The Unicode Standard Version 11.0 Core Specification

Similar documents
Introduction 1. Chapter 1

The Unicode Standard Version 12.0 Core Specification

The Unicode Standard Version 10.0 Core Specification

The Unicode Standard Version 11.0 Core Specification

The Unicode Standard Version 10.0 Core Specification

The Unicode Standard Version 6.1 Core Specification

The Unicode Standard Version 12.0 Core Specification

UNICODE IDEOGRAPHIC VARIATION DATABASE

2011 Martin v. Löwis. Data-centric XML. Character Sets

This PDF file is an excerpt from The Unicode Standard, Version 4.0, issued by the Unicode Consortium and published by Addison-Wesley.

2007 Martin v. Löwis. Data-centric XML. Character Sets

Recent Trends in Standardization of Japanese Character Codes

ATypI Hongkong Development of a Pan-CJK Font

Proposed Update. Unicode Standard Annex #11

ä + ñ ISO/IEC JTC1/SC2/WG2 N

Proposed Update Unicode Standard Annex #11 EAST ASIAN WIDTH

Unicode character. Unicode JIS X 0213 GB *2. Unicode character *3. John Mauchly Short Order Code character. Unicode Unicode ASCII.

This PDF file is an excerpt from The Unicode Standard, Version 4.0, issued by the Unicode Consortium and published by Addison-Wesley.

General Structure 2. Chapter Architectural Context

Two distinct code points: DECIMAL SEPARATOR and FULL STOP

The process of preparing an application to support more than one language and data format is called internationalization. Localization is the process

The Unicode Standard Version 11.0 Core Specification

The Unicode Standard Version 6.2 Core Specification

Proposed Update Unicode Standard Annex #34

The Use of Unicode in MARC 21 Records. What is MARC?

UTF and Turkish. İstinye University. Representing Text

AFP Support for TrueType/Open Type Fonts and Unicode

The Unicode Standard Version 9.0 Core Specification

Representing Characters and Text

Representing Characters, Strings and Text

The Unicode Standard Version 11.0 Core Specification

Designing & Developing Pan-CJK Fonts for Today

Using Unicode with MIME

Title: Application to include Arabic alphabet shapes to Arabic 0600 Unicode character set

Unicode definition list

Chapter 4: Computer Codes. In this chapter you will learn about:

1. Introduction 2. TAMIL DIGIT ZERO JTC1/SC2/WG2 N Character proposed in this document About INFITT and INFITT WG

1. Introduction 2. TAMIL LETTER SHA Character proposed in this document About INFITT and INFITT WG

ISO/IEC JTC 1/SC 2 N 3332/WG2 N 2057

The Power of Plain Text & the Importance of Meaningful Content Dr. Ken Lunde Senior Computer Scientist Adobe Systems Incorporated

Multilingual mathematical e-document processing

****This proposal has not been submitted**** ***This document is displayed for initial feedback only*** ***This proposal is currently incomplete***

Conformance 3. Chapter Versions of the Unicode Standard

Compound Text Encoding

ISO INTERNATIONAL STANDARD. Information and documentation Transliteration of Devanagari and related Indic scripts into Latin characters

Computer Science Applications to Cultural Heritage. Introduction to computer systems

INTERNATIONAL STANDARD. This is a preview - click here to buy the full publication

Flerspråkiga delmängder i ISO/IEC

Form number: N2352-F (Original ; Revised , , , , , , ) N2352-F Page 1 of 7

Proposal to Encode Oriya Fraction Signs in ISO/IEC 10646

The Adobe-CNS1-6 Character Collection

Proposal to Add Four SENĆOŦEN Latin Charaters

Information technology Keyboard layouts for text and office systems. Part 9: Multi-lingual, multiscript keyboard layouts

ISO/IEC INTERNATIONAL STANDARD

Part III: Survey of Internet technologies

Internet Engineering Task Force (IETF) Category: Standards Track March 2015 ISSN:

This PDF file is an excerpt from The Unicode Standard, Version 4.0, issued by the Unicode Consortium and published by Addison-Wesley.

A B Ɓ C D Ɗ Dz E F G H I J K Ƙ L M N Ɲ Ŋ O P R S T Ts Ts' U W X Y Z

Character Encodings. Fabian M. Suchanek

TECkit version 2.0 A Text Encoding Conversion toolkit

ISO/IEC JTC 1/SC 2/WG 2 N3086 PROPOSAL SUMMARY FORM TO ACCOMPANY SUBMISSIONS 1

INTERNATIONAL STANDARD

ISO/IEC TR TECHNICAL REPORT. Information technology Guidelines for the preparation of programming language standards

ISO/IEC JTC 1/SC 2/WG 2 Proposal summary form N2652-F accompanies this document.

The Unicode Standard Version 11.0 Core Specification

Request for Comments: 2482 Category: Informational Spyglass January Language Tagging in Unicode Plain Text. Status of this Memo

Response to the revised "Final proposal for encoding the Phoenician script in the UCS" (L2/04-141R2)

ISO/IEC AMENDMENT

Introduction. Acknowledgements

Network Working Group Request for Comments: Category: Best Current Practice January IANA Charset Registration Procedures

DESIGNING A DIGITAL LIBRARY WITH BENGALI LANGUAGE S UPPORT USING UNICODE

Comments on responses to objections provided in N2661

This manual describes utf8gen, a utility for converting Unicode hexadecimal code points into UTF-8 as printable characters for immediate viewing and

General Structure 2. Chapter Architectural Context

Legacy Gaiji Solutions & SING

This proposal is limited to the addition and rearrangement of some of the Korean character part of ISO/IEC (UCS2).

Proposal to Encode the Ganda Currency Mark for Bengali in the BMP of the UCS

ISO/IEC JTC 1/SC 2/WG 2 PROPOSAL SUMMARY FORM TO ACCOMPANY SUBMISSIONS FOR ADDITIONS TO THE REPERTOIRE OF ISO/IEC

Unicode: What is it and how do I use it?

JiveX Enterprise PACS Solutions. JiveX DICOM Worklist Broker Conformance Statement - DICOM. Version: As of

The Unicode Standard Version 7.0 Core Specification

ISO/IEC JTC1/SC2/WG2. Universal Multiple-Octet Coded Character Set (UCS) - ISO/IEC Secretariat: ANSI

Universal Acceptance Technical Perspective. Universal Acceptance

The Unicode Standard Version 10.0 Core Specification

Category: Informational 1 April 2005

The Unicode Standard. Unicode Summary description. Unicode character database (UCD)

Unicode Support. Chapter 2:

Picsel epage. PowerPoint file format support

Cindex 3.0 for Windows. Release Notes

Reply to L2/10-327: Comments on L2/10-280, Proposal to Add Variation Sequences... 1

ISO/IEC JTC1/SC2/WG2 N2641

ISO/IEC JTC 1/SC 2/WG 2 PROPOSAL SUMMARY FORM TO ACCOMPANY SUBMISSIONS. Yes (or) More information will be provided later:

Network Working Group. Category: Informational July 1995

Ꞑ A790 LATIN CAPITAL LETTER A WITH SPIRITUS LENIS ꞑ A791 LATIN SMALL LETTER A WITH SPIRITUS LENIS

ISO/IEC INTERNATIONAL STANDARD. Information technology Coding of audio-visual objects Part 18: Font compression and streaming

Request for Comments: July 1994

John H. Jenkins If available now, identify source(s) for the font (include address, , ftp-site, etc.) and indicate the tools used:

This document is to be used together with N2285 and N2281.

Transliteration of Tamil and Other Indic Scripts. Ram Viswanadha Unicode Software Engineer IBM Globalization Center of Competency, California, USA

ISO/IEC JTC/1 SC/2 WG/2 N2095

Transcription:

The Unicode Standard Version 11.0 Core Specification To learn about the latest version of the Unicode Standard, see http://www.unicode.org/versions/latest/. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and the publisher was aware of a trademark claim, the designations have been printed with initial capital letters or in all capitals. Unicode and the Unicode Logo are registered trademarks of Unicode, Inc., in the United States and other countries. The authors and publisher have taken care in the preparation of this specification, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. 2018 Unicode, Inc. All rights reserved. This publication is protected by copyright, and permission must be obtained from the publisher prior to any prohibited reproduction. For information regarding permissions, inquire at http://www.unicode.org/reporting.html. For information about the Unicode terms of use, please see http://www.unicode.org/copyright.html. The Unicode Standard / the Unicode Consortium; edited by the Unicode Consortium. Version 11.0. Includes index. ISBN 978-1-936213-19-1 (http://www.unicode.org/versions/unicode11.0.0/) 1. Unicode (Computer character set) I. Unicode Consortium. QA268.U545 2018 ISBN 978-1-936213-19-1 Published in Mountain View, CA June 2018

1 Chapter 1 Introduction 1 The Unicode Standard is the universal character encoding standard for written characters and text. It defines a consistent way of encoding multilingual text that enables the exchange of text data internationally and creates the foundation for global software. As the default encoding of HTML and XML, the Unicode Standard provides the underpinning for the World Wide Web and the global business environments of today. Required in new Internet protocols and implemented in all modern operating systems and computer languages such as Java and C#, Unicode is the basis of software that must function all around the world. With Unicode, the information technology industry has replaced proliferating character sets with data stability, global interoperability and data interchange, simplified software, and reduced development costs. While taking the ASCII character set as its starting point, the Unicode Standard goes far beyond ASCII s limited ability to encode only the upper- and lowercase letters A through Z. It provides the capacity to encode all characters used for the written languages of the world more than 1 million characters can be encoded. No escape sequence or control code is required to specify any character in any language. The Unicode character encoding treats alphabetic characters, ideographic characters, and symbols equivalently, which means they can be used in any mixture and with equal facility (see Figure 1-1). The Unicode Standard specifies a numeric value (code point) and a name for each of its characters. In this respect, it is similar to other character encoding standards from ASCII onward. In addition to character codes and names, other information is crucial to ensure legible text: a character s case, directionality, and alphabetic properties must be well defined. The Unicode Standard defines these and other semantic values, and it includes application data such as case mapping tables and character property tables as part of the Unicode Character Database. Character properties define a character s identity and behavior; they ensure consistency in the processing and interchange of Unicode data. See Section 4.1, Unicode Character Database. Unicode characters are represented in one of three encoding forms: a 32-bit form (UTF- 32), a 16-bit form (UTF-16), and an 8-bit form (UTF-8). The 8-bit, byte-oriented form, UTF-8, has been designed for ease of use with existing ASCII-based systems. The Unicode Standard is code-for-code identical with International Standard ISO/IEC 10646. Any implementation that is conformant to Unicode is therefore conformant to ISO/ IEC 10646. The Unicode Standard contains 1,114,112 code points, most of which are available for encoding of characters. The majority of the common characters used in the major languages of the world are encoded in the first 65,536 code points, also known as the Basic

Introduction 2 Figure 1-1. Wide ASCII ASCII/8859-1 Text Unicode Text A S C I I / 8 8 5 9-1 0100 0001 0101 0011 0100 0011 0100 1001 0100 1001 0010 1111 0011 1000 0011 1000 0011 0101 0011 1001 0010 1101 0011 0001 A S C I I 0000 0000 0100 0001 0000 0000 0101 0011 0000 0000 0100 0011 0000 0000 0100 1001 0000 0000 0100 1001 0000 0000 0010 0000 0101 1001 0010 1001 0101 0111 0011 0000 0000 0000 0010 0000 0000 0110 0011 0011 0000 0110 0100 0100 0000 0110 0010 0111 t e x t 0010 0000 0111 0100 0110 0101 0111 1000 0111 0100 0000 0110 0100 0101 0000 0000 0010 0000 0000 0011 1011 0001 0010 0010 0111 0000 0000 0011 1011 0011 Multilingual Plane (BMP). The overall capacity for more than 1 million characters is more than sufficient for all known character encoding requirements, including full coverage of all minority and historic scripts of the world.

Introduction 3 1.1 Coverage 1.1 Coverage The Unicode Standard, Version 11.0, contains 137,374 characters from the world s scripts. These characters are more than sufficient not only for modern communication for the world s languages, but also to represent the classical forms of many languages. The standard includes the European alphabetic scripts, Middle Eastern right-to-left scripts, and scripts of Asia and Africa. Many archaic and historic scripts are encoded. The Han script includes 87,887 unified ideographic characters defined by national, international, and industry standards of China, Japan, Korea, Taiwan, Vietnam, and Singapore. In addition, the Unicode Standard contains many important symbol sets, including currency symbols, punctuation marks, mathematical symbols, technical symbols, geometric shapes, dingbats, and emoji. For overall character and code range information, see Chapter 2, General Structure. Note, however, that the Unicode Standard does not encode idiosyncratic, personal, novel, or private-use characters, nor does it encode logos or graphics. Graphologies unrelated to text, such as dance notations, are likewise outside the scope of the Unicode Standard. Font variants are explicitly not encoded. The Unicode Standard reserves 6,400 code points in the BMP for private use, which may be used to assign codes to characters not included in the repertoire of the Unicode Standard. Another 131,068 private-use code points are available outside the BMP, should 6,400 prove insufficient for particular applications. Standards Coverage The Unicode Standard is a superset of all characters in widespread use today. It contains the characters from major international and national standards as well as prominent industry character sets. For example, Unicode incorporates the ISO/IEC 6937 and ISO/ IEC 8859 families of standards, the SGML standard ISO/IEC 8879, and bibliographic standards such as ISO 5426. Important national standards contained within Unicode include ANSI Z39.64, KS X 1001, JIS X 0208, JIS X 0212, JIS X 0213, GB 2312, GB 18030, HKSCS, and CNS 11643. Industry code pages and character sets from Adobe, Apple, Fujitsu, Hewlett-Packard, IBM, Lotus, Microsoft, NEC, and Xerox are fully represented as well. The Unicode Standard is fully conformant with the International Standard ISO/IEC 10646:2017, Information Technology Universal Coded Character Set (UCS), known as the Universal Character Set (UCS). For more information, see Appendix C, Relationship to ISO/IEC 10646. New Characters The Unicode Standard continues to respond to new and changing industry demands by encoding important new characters. As the universal character encoding, the Unicode Standard also responds to scholarly needs. To preserve world cultural heritage, important archaic scripts are encoded as consensus about the encoding is developed.

Introduction 4 1.2 Design Goals 1.2 Design Goals The Unicode Standard began with a simple goal: to unify the many hundreds of conflicting ways to encode characters, replacing them with a single, universal standard. The pre-existing legacy character encodings were both inconsistent and incomplete two encodings could use the same codes for two different characters and use different codes for the same characters, while none of the encodings handled any more than a small fraction of the world s languages. Whenever textual data was converted between different programs or platforms, there was a substantial risk of corruption. Programs often were written only to support particular encodings, making development of international versions expensive. As a result, developing countries were particularly hard-hit, as it was not economically feasible to adapt specific versions of programs for smaller markets. Technical fields such as mathematics were also disadvantaged, because they were forced to use special fonts to represent arbitrary characters, often leading to garbled content. The designers of the Unicode Standard envisioned a uniform method of character identification that would be more efficient and flexible than previous encoding systems. The new system would satisfy the needs of technical and multilingual computing and would encode a broad range of characters for all purposes, including worldwide publication. The Unicode Standard was designed to be: Universal. The repertoire must be large enough to encompass all characters that are likely to be used in general text interchange, including those in major international, national, and industry character sets. Efficient. Plain text is simple to parse: software does not have to maintain state or look for special escape sequences, and character synchronization from any point in a character stream is quick and unambiguous. A fixed character code allows for efficient sorting, searching, display, and editing of text. Unambiguous. Any given Unicode code point always represents the same character. Figure 1-2 demonstrates some of these features, contrasting the Unicode encoding with mixtures of single-byte character sets with escape sequences to shift the meanings of bytes in the ISO/IEC 2022 framework using multiple character encoding standards.

Introduction 5 1.2 Design Goals Figure 1-2. Unicode Compared to the 2022 Framework Unicode A 0041 å 00E5 0645 03B5 0131 65E5 or 2022 + 8859 + JIS A 41 å E5 G 2D 47 F 2D 46 C 2D 43 2D $ 24 M 4D B 42 å E5 å E5 1 B9 ý FD F 46 7C

Introduction 6 1.3 Text Handling 1.3 Text Handling The assignment of characters is only a small fraction of what the Unicode Standard and its associated specifications provide. The specifications give programmers extensive descriptions and a vast amount of data about the handling of text, including how to: divide words and break lines sort text in different languages format numbers, dates, times, and other elements appropriate to different locales display text for languages whose written form flows from right to left, such as Arabic or Hebrew display text in which the written form splits, combines, and reorders, such as for the languages of South Asia deal with security concerns regarding the many look-alike characters from writing systems around the world Without the properties, algorithms, and other specifications in the Unicode Standard and its associated specifications, interoperability between different implementations would be impossible. With the Unicode Standard as the foundation of text representation, all of the text on the Web can be stored, searched, and matched with the same program code. Characters and Glyphs The difference between identifying a character and rendering it on screen or paper is crucial to understanding the Unicode Standard s role in text processing. The character identified by a Unicode code point is an abstract entity, such as latin capital letter a or bengali digit five. The mark made on screen or paper, called a glyph, is a visual representation of the character. The Unicode Standard does not define glyph images. That is, the standard defines how characters are interpreted, not how glyphs are rendered. Ultimately, the software or hardware rendering engine of a computer is responsible for the appearance of the characters on the screen. The Unicode Standard does not specify the precise shape, size, or orientation of on-screen characters. Text Elements The successful encoding, processing, and interpretation of text requires appropriate definition of useful elements of text and the basic rules for interpreting text. The definition of text elements often changes depending on the process that handles the text. For example, when searching for a particular word or character written with the Latin script, one often wishes to ignore differences of case. However, correct spelling within a document requires case sensitivity.

Introduction 7 1.3 Text Handling The Unicode Standard does not define what is and is not a text element in different processes; instead, it defines elements called encoded characters. An encoded character is represented by a number from 0 to 10FFFF 16, called a code point. A text element, in turn, is represented by a sequence of one or more encoded characters.

Introduction 8 1.3 Text Handling