Unicode and Standardized Notation. Anthony Aristar

Similar documents
Title: Graphic representation of the Roadmap to the BMP of the UCS

Domain Names in Pakistani Languages. IDNs for Pakistani Languages

Title: Graphic representation of the Roadmap to the BMP, Plane 0 of the UCS

(URW) ++ UNICODE APERÇU 1. Nimbus Sans Block Name. Regular. Bold. Light Vers Regular. Regular. Bold. Medium. Vers Vers Vers. 4.

Multimedia Data. Multimedia Data. Text Vector Graphics 3-D Vector Graphics. Raster Graphics Digital Image Voxel. Audio Digital Video

The Unicode Standard. Version 3.0. The Unicode Consortium ADDISON-WESLEY. An Imprint of Addison Wesley Longman, Inc.

JAVA.LANG.CHARACTER.UNICODEBLOCK CLASS

Thu Jun :48:11 Canada/Eastern

ISO/IEC JTC 1/SC 2 N 3426

The Unicode Standard Version 6.1 Core Specification

COSC 243 (Computer Architecture)

Building Apps Last updated: 12 June 2017

Unicode: What is it and how do I use it?

Language Processing with Perl and Prolog

NRSI: Computers & Writing Systems

The Unicode Standard Version 11.0 Core Specification

Can R Speak Your Language?

To the BMP and beyond!

The Unicode Standard Version 10.0 Core Specification

The Use of Unicode in MARC 21 Records. What is MARC?

Representing Characters, Strings and Text

Practical character sets

Using non-latin alphabets in Blaise

The Unicode Standard Version 12.0 Core Specification

Blending Content for South Asian Language Pedagogy Part 2: South Asian Languages on the Internet

ISO INTERNATIONAL STANDARD. Information and documentation Transliteration of Devanagari and related Indic scripts into Latin characters

Google Search Appliance

UTF and Turkish. İstinye University. Representing Text

The Unicode Standard Version 10.0 Core Specification

Representing Characters and Text

Talk2You User Manual Smartphone / Tablet

Lecture 25: Internationalization. UI Hall of Fame or Shame? Today s Topics. Internationalization Design challenges Implementation techniques

IT101. Characters: from ASCII to Unicode

An Introduction to the Internationalisation of econtent

This PDF file is an excerpt from The Unicode Standard, Version 4.0, issued by the Unicode Consortium and published by Addison-Wesley.

SAPGUI for Windows - I18N User s Guide

2011 Martin v. Löwis. Data-centric XML. Character Sets

2007 Martin v. Löwis. Data-centric XML. Character Sets

Keyman, LANGIDs & Codepages

L2/ Re: Proposal for v10.1 of UTS #39 From: Mark Davis Date: Draft: link

Part III: Survey of Internet technologies

Dictionary App Builder: Building Apps

Character Encodings. Fabian M. Suchanek

LBSC 690: Information Technology Lecture 05 Structured data and databases

Introduction to Normalization and Modern Collation

Basis Technology Unicode 対応ライブラリスペックシート

The Unicode Standard Version 11.0 Core Specification

Casabac Unicode Support

Information, Characters, Unicode

Proposal on Handling Reph in Gurmukhi and Telugu Scripts

PLATYPUS FUNCTIONAL REQUIREMENTS V. 2.02

Multilingual mathematical e-document processing

Chapter 11 : Computer Science. Information Representation. Class XI ( As per CBSE Board) New Syllabus

The name of our class will be Yo. Type that in where it says Class Name. Don t hit the OK button yet.

The Unicode Standard Version 11.0 Core Specification

PROPOSALS FOR MALAYALAM AND TAMIL SCRIPTS ROOT ZONE LABEL GENERATION RULES

General Structure 2. Chapter Architectural Context

This PDF file is an excerpt from The Unicode Standard, Version 4.0, issued by the Unicode Consortium and published by Addison-Wesley.

108_GILLAM.index.fm Page 817 Monday, August 19, :35 PM. Index

All Localized. Your Localization Services Partner. Company Profile

Proposals For Devanagari, Gurmukhi, And Gujarati Scripts Root Zone Label Generation Rules

This document is to be used together with N2285 and N2281.

Activity Report at SYSTRAN S.A.

UNICODE SCRIPT NAMES PROPERTY

e - HAND BOOK OF DOMESTIC MEDICINE AND COMMON AYURVEDIC REMEDIES User Manual

This PDF file is an excerpt from The Unicode Standard, Version 4.0, issued by the Unicode Consortium and published by Addison-Wesley.

Consent docket re WG2 Resolutions at its Meeting #35 as amended. For the complete text of Resolutions of WG2 Meeting #35, see L2/98-306R.

Public Meeting Agenda Formatting Best Practices

Variables and Data Representation

Transliteration of Tamil and Other Indic Scripts. Ram Viswanadha Unicode Software Engineer IBM Globalization Center of Competency, California, USA

Introduction 1. Chapter 1

Unicode definition list

The Unicode Standard Version 12.0 Core Specification

Request for encoding GRANTHA LENGTH MARK

Table of Contents. Installation Global Office Mini-Tutorial Additional Information... 12

Two distinct code points: DECIMAL SEPARATOR and FULL STOP

An e-infrastructure for Language Documentation on the Web

Intro. Scheme Basics. scm> 5 5. scm>

Our Hall of Fame or Shame candidate for the day is this interface for choosing how a list of database records should be sorted.

Divisibility Rules and Their Explanations

You 2 Software

Using the FirstVoices Kwa wala Keyboard

Coordination! As complex as Format Integration!

Proposed Update Unicode Standard Annex #11 EAST ASIAN WIDTH

customization tools!

IBM i2 ibase IntelliShare Release Notes

COM Text User Manual

ISO/IEC JTC 1/SC 2 N 3332/WG2 N 2057

Proposed Update. Unicode Standard Annex #11

The Unicode Standard Version 11.0 Core Specification

GLOBALISATION. History. Simple example. What should be globalised?

TECkit version 2.0 A Text Encoding Conversion toolkit

Package utf8latex. December 26, 2016

Part II Composition of Functions

Keyboard Version 1.1 designed with Manual version 1.2, June Prepared by Vincent M. Setterholm, Logos Research Systems, Inc.

[MS-UCODEREF]: Windows Protocols Unicode Reference. Intellectual Property Rights Notice for Open Specifications Documentation

RomanCyrillic Std v. 7

Introduction to Scientific Typesetting Lesson 11: Foreign Languages, Columns, and Section Titles

Using Development Tools to Examine Webpages

Displaying Chinese Characters In Blaise

Transcription:

Data Management and Archiving University of California at Santa Barbara, June 24-27, 2008 Unicode and Standardized Notation Anthony Aristar

Once upon a time There were people who decided to invent computers. And they thought: What are all the characters anyone could possibly need? We ll code these. It was even perfect, in that you could use 7 bits to code these characters. The code set, which was called ASCII, was able to represent every character using a number between 32 and 127. Space was 32, the letter "A" was 65, and so on. Codes below 32 were used for control characters, and if you turned the eighth bit on, you could use codes 128-255 for whatever you liked. And everyone lived happily ever after. June 24-27, 2008 Data Management & Archiving 2

Pesky Europeans Into this perfect world intruded all these odd people who don t speak English. Most of them don t matter of course, but there are a few, like Germans French Spanish Italians Russians Swedes and so on, who have lots of money or power and complained. June 24-27, 2008 Data Management & Archiving 3

The fix was in After grumbling about how picky these Europeans were, the computer people had a bright idea. What about the characters above 127? Why not just use some of these for the few characters most of these guys use? And in languages like Russian, you could use the ASCII codes that would usually give you B or P, but instead you d see Б and П. June 24-27, 2008 Data Management & Archiving 4

Code Pages Eventually this free-for-all got codified in the ANSI standard. In the ANSI standard, everybody agreed on what to do below 128, which was pretty much the same as ASCII, but there were lots of different ways to handle the characters from 128 and on up, depending on where you lived. These different systems were called code pages. June 24-27, 2008 Data Management & Archiving 5

Rendering The idea was that you could take a code like, say, 123, but it could appear as any one of these, on different code pages: ä æ é à In short, you use the same code-point, but you d render display it -- it in a different way when you chose different code-pages. Everyone was now happy again June 24-27, 2008 Data Management & Archiving 6

Problem But how do you access these different code pages? Even with this ANSI standard, you couldn t get more than a single character set on one computer, for the operating system had to choose what code page rendering to use. The solution: give the operating system access to all the code pages, and let the font being used decide which rendering should be used in each case. June 24-27, 2008 Data Management & Archiving 7

What did Fonts do? Thus, when you wanted to change to a different character set, you simply switched the font. The font would encode each character with the appropriate code-point from the appropriate code page, and display it with the appropriate character shape. Problem: are fonts, character-sets and code-pages the same? June 24-27, 2008 Data Management & Archiving 8

So Well, no. You can see the results of this confusion by considering the following. Here s a short sentence in Russian: Если Украина вступит в НАТО, между ней June 24-27, 2008 Data Management & Archiving 9

Great isn t it? But if a colleague in Russia sent this to you, you would see this text: Esli Ukraina vstupit v NATO, me`du nei Exactly the same code-points from the right code-page are there But since he has a Cyrillic font on this machine, and you don t, the characters just aren t being rendered as they should be. June 24-27, 2008 Data Management & Archiving 10

Meanwhile, in the east That s one problem. But there s another. 256 possible characters sounds like a lot. But what about Chinese? Japanese? There s no way you can fit thousands of characters into 8 bits. Solution : DBCS, or "double byte character set" in which some letters are stored in one byte and others take two. June 24-27, 2008 Data Management & Archiving 11

The joys of DBCS Now you have another problem. A typical mail header in DBCS: Subject: ¹ÚÆ ¹ÀÌ³Ê µå  Çコ Å ÅÁø NO.4 If you don t happen to have a DCBS interpreter on your machine, anyway. June 24-27, 2008 Data Management & Archiving 12

And then Unicode! What about this as an idea? Every character in every language has a code-point. No code-point is ever reused. Every code-point has a unique rendering or set of renderings, all of which are seen as equivalent. That way, so long as you have a font coded for this system, no matter what someone else may be using, the character will always show up correctly In short, you can use any font for this system, just so long as it s built for it. June 24-27, 2008 Data Management & Archiving 13

But how to encode this? Obviously, one 8-bit byte won t do it for such a system. We need tens of thousands of codepoints So use more than one byte. But how many? Well, you could use two 8-bit bytes. That will give you 65,536 possible unique characters. That will deal with most though not quite all of the characters used in human language. June 24-27, 2008 Data Management & Archiving 14

Unicode continued So is Unicode a character encoding system that uses two bytes per character? No. And to prove it, take a text, write it in ASCII, drop it into a word-processor, and turn it into Unicode. Is the text still interpretable? Well, sometimes it is, and sometimes it isn t. It seems to depend upon what kind of Unicode you turn it into. June 24-27, 2008 Data Management & Archiving 15

Contradiction? If each Unicode character is unique, what is going on? How come sometimes ASCII seems to come through fine, while at other times it doesn t? This seems to imply that a character can have more than one Unicode encoding. Is there a contradiction here? June 24-27, 2008 Data Management & Archiving 16

Code points revisited No, it s not a contradiction, for Unicode does something very different from previous character systems. It distinguishes code points from encodings. All Unicode characters have a unique code point. But these can be encoded in different ways. June 24-27, 2008 Data Management & Archiving 17

Encodings For example, the character e has the code-point U+0065. That s it. e can never have any other code-point. This can be encoded in two bytes as 00 65 This is called UTF-16, since it uses 16 bits to store an actual character. But it can also be encoded as UTF-8 June 24-27, 2008 Data Management & Archiving 18

The Solution: UTF-8 If UTF-16 uses 16 bits to store a character, does this mean that UTF-8 uses 8 bits? Well, partly. The way it works is: Every code-point from 0-127 is stored in one byte (same as ASCII) Every code-point above 127 is stored in either two, three, or four bytes. The result: every ASCII file looks exactly the same in UTF-8 as it does in a system using 256 code-points. Because this ensures that old files will be readable, almost all Unicode web sites use UTF-8. June 24-27, 2008 Data Management & Archiving 19

Other Unicode Encodings UTF-7, which is very like UTF-8 except that the high (eighth) bit is always zero (rarely used, and then usually in archaic mail programs). UCS-4 stores each code point in 4 byte chunks, but uses a mammoth amount of memory, so almost no one uses it. UTF-32, which uses 4 bytes for every character. Very rare, but necessary if we want to be able to encode every character ever used. June 24-27, 2008 Data Management & Archiving 20

Isn t all this variety wonderful? Well, not quite. It means that even if you re using Unicode, you have to say which Unicode encoding you re using. That s why every web page should have something like this in its header: <meta http-equiv="content-type" content="text/html; charset=utf-8"> And this has to be the very first thing in the header because this is what tells the web-server how to read the rest of the file. If you don t do this, the web-server is going to guess, and it often gets it wrong. June 24-27, 2008 Data Management & Archiving 21

Rendering again We mentioned rendering before. In old style non-unicode fonts, the distinction between character and rendering meant nothing. Rendering was used to get both different forms of a character and different characters. The issue of glyph was completely ignored. In Unicode, character, glyph and rendering must all be distinguished. June 24-27, 2008 Data Management & Archiving 22

The difference between Character and Glyph June 24-27, 2008 Data Management & Archiving 23

The difference between glyph and character on the one hand and rendering on the other One Character Many glyphs Rendered Appropriately June 24-27, 2008 Data Management & Archiving 24

Ordering June 24-27, 2008 Data Management & Archiving 25

Just one problem to be wary of: Combining characters In Unicode accent marks can be represented with their own code point values = U+0065 (e) U+0301 (accent) but common combinations of letters and accents are also given their own code points for convenience. = U+00E9 June 24-27, 2008 Data Management & Archiving 26

In summary: Character semantics The Unicode standard includes an extensive database that specifies a large number of character properties, including: Name Type (e.g., letter, digit, punctuation mark) Decomposition Case and case mappings (for cased letters) Numeric value (for digits and numerals) Combining class (for combining characters) Directionality Line-breaking behavior Cursive joining behavior For Chinese characters, mappings to various other standards and many other properties June 24-27, 2008 Data Management & Archiving 27

The General Scripts Area 00/01 02/03 04/05 06/07 08/09 0A/0B 0C/0D 0E/0F 10/11 12/13 14/15 16/17 18/19 1A/1B 1C/1D 1E/1F Latin IPA Diacritics Greek Cyrillic Armenian Hebrew Arabic Syriac Thaana Devanagari Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada Malayalam Sinhala Thai Lao Tibetan Myanmar Georgian Hangul Ethiopic Canadian Aboriginal Syllabics Ogham Runic Philippine Khmer Mongolian Latin Greek Cherokee June 24-27, 2008 Data Management & Archiving 28

Unicode Coverage European scripts Latin, Greek, Cyrillic, Armenian, Georgian, IPA Bidirectional (Middle Eastern) scripts Hebrew, Arabic, Syriac, Thaana Indic (Indian and Southeast Asian) scripts Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Thai, Lao, Khmer, Myanmar, Tibetan, Philippine East Asian scripts Chinese (Han) characters, Japanese (Hiragana and Katakana), Korean (Hangul), Yi Other modern scripts Mongolian, Ethiopic, Cherokee, Canadian Aboriginal Historical scripts Runic, Ogham, Old Italic, Gothic, Deseret Punctuation and symbols Numerals, math symbols, scientific symbols, arrows, blocks, geometric shapes, Braille, musical notation, etc. June 24-27, 2008 Data Management & Archiving 29

Markup and Annotation This seems like a very simple issue. How do we mark up linguistic material? Does it matter after all what we call grammatical categories in something like the following so long as we are clear? Afisi a -na -ph -a nsomba hyenas SP-PST-kill-ASP fish 'The hyenas killed the fish.' June 24-27, 2008 Data Management & Archiving 30

Comparing annotations The problem is that though something like that might be clear to a human being, it s not at all clear to a computer. So how do we use computers to compare data across mark-up schemes, when everyone uses the annotation they like? June 24-27, 2008 Data Management & Archiving 31

The Solution: Standardized Markup The solution is to standardize markup in some way. So what do we do? The answer: use an interlanguage for translation among markups June 24-27, 2008 Data Management & Archiving 32

A number of Interlanguages Leipzig Glossing rules (http://www.eva.mpg.de/lingua/resources/glossingrules.php) Developed by Bernard Comrie, Martin Haspelmath, and Balthasar Bickel at Leipzig. Consist of: Ten rules for the "syntax" and "semantics" of interlinear glosses and an appendix with a proposed "lexicon" of abbreviated category labels. If used consistently, these can be used to produce interoperability between data sets. June 24-27, 2008 Data Management & Archiving 33

But Problem: the rules require that everyone do exactly the same thing, and use exactly the same terms. What is needed is a system that allows us to compare linguistic data sets which use different markup definitions But also: Methods are needed by which the semantics as well as the syntax of each tagging scheme can be compared, e.g.. Both A and B use absolutive. Do they mean the same thing? A uses possessive where B uses genitive. How do they compare? June 24-27, 2008 Data Management & Archiving 34

What we need A system that allows us to compare linguistic data sets which use different markup definitions BUT: Methods are needed by which the semantics as well as the syntax of each tagging scheme can be compared, e.g.. Both A and B use absolutive. Do they mean the same thing? A uses possessive where B uses genitive. How do they compare? June 24-27, 2008 Data Management & Archiving 35

Solutions Two means exist for doing this: DatCats Ontologies June 24-27, 2008 Data Management & Archiving 36

DAT-CATs DatCats: ISO-TC37/SC4 Data Category Registry (http://www.localisation.ie/resources/presentations/2003/ Conference/Presentations/Laurent%20Romary.ppt) Examples of values: ablativecase or dativecase. Since you can link your own terms to these DatCats, you can use them while still using your own terms. DatCats are relatively unstructured. They are essentially a set of terms to which things can be linked. (But more structure seems to be on the way ) June 24-27, 2008 Data Management & Archiving 37

Or you can use an Ontology An ontology is most similar to the DatCat system, but has more structure, in the form of hierarchy. It defines, in computer terms, what terms are the same, how they are equivalent, and how terms can be compared. It defines what terms subsume others. Just as with DatCats you can allow everyone to use their own terminology no-one has to change but still allow machines to compare linguistic material. Example: the GOLD (the General Ontology of Linguistic Description): http://www.linguistics-ontology.org/ June 24-27, 2008 Data Management & Archiving 38

In short, you can have a system like this Sample query: "ergative P2" End user GUI Returns: Examples of ergative constructions from P2 languages Query Engine Knowledge Base of Terminology, including Ontology Hopi <XML> Nahuatl <XML> Yaqui <XML> June 24-27, 2008 Data Management & Archiving 39

GOLD: A linguistic Ontology Contents Knowledge base of Linguistic Terminology Current version concentrates specifically on Morphosyntactic Terminology Structure Inheritance, is-a Multiple Inheritance Mereological (Part-Whole) Relations And other relations June 24-27, 2008 Data Management & Archiving 40