Blending Content for South Asian Language Pedagogy Part 2: South Asian Languages on the Internet

Similar documents
Proposals For Devanagari, Gurmukhi, And Gujarati Scripts Root Zone Label Generation Rules

Transliteration of Tamil and Other Indic Scripts. Ram Viswanadha Unicode Software Engineer IBM Globalization Center of Competency, California, USA

Proposal on Handling Reph in Gurmukhi and Telugu Scripts

Joiners (ZWJ/ZWNJ) with Semantic content for words in Indian subcontinent languages

The right hehs for Arabic script orthographies of Sorani Kurdish and Uighur

Domain Names in Pakistani Languages. IDNs for Pakistani Languages

Rendering in Dzongkha

ISO INTERNATIONAL STANDARD. Information and documentation Transliteration of Devanagari and related Indic scripts into Latin characters

(URW) ++ UNICODE APERÇU 1. Nimbus Sans Block Name. Regular. Bold. Light Vers Regular. Regular. Bold. Medium. Vers Vers Vers. 4.

Issues in Indic Language Collation

Issues in Indic Language Collation

REMARKS ON THE USE OF ZWJ & ZWNJ IN THE BRAHMI AND PERSO- ARABIC FAMILIES

Proposal for changes to ArabicShaping.txt to allow machine generation of Arabic fonts and glyphs. A. Generating Arabic glyphs from the Schematic Name

Extensible Rendering for Complex Writing Systems

INTERNATIONALIZATION IN GVIM

The Unicode Standard Version 10.0 Core Specification

pdfcalligraph an itext 7 add-on pdfcalligraph

e - HAND BOOK OF DOMESTIC MEDICINE AND COMMON AYURVEDIC REMEDIES User Manual

OpenType Font by Harsha Wijayawardhana UCSC

Internationalized Domain Names Variant Issues Project

ISO/IEC JTC/1 SC/2 WG/2 N2312. ISO/IEC JTC/1 SC/2 WG/2 Universal Multiple-Octet Coded Character Set (UCS)

ஒர ங க ற ததத ற றம ம தகத ட பத ட ம ம னவர ரமணஶர மத இந த யவ யல/ததத ழ லந ட ப ஆய வத ளர தம ழ நத ட

ISO/IEC JTC 1/SC 2/WG 2 Proposal summary form N2652-F accompanies this document.

COSC 243 (Computer Architecture)

Chapter 11 : Computer Science. Information Representation. Class XI ( As per CBSE Board) New Syllabus

Center for Language Engineering Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore

FLT: Font Layout Table

Indic NLP Library Documentation

Unicode in Education. Adil Allawi Technical Director

AFP Support for TrueType/Open Type Fonts and Unicode

Thu Jun :48:11 Canada/Eastern

FileMaker 15 Specific Features

Survey of Language Computing in Asia 2005

Nafees Nastaleeq v1.02 beta

Kannada 2. L2/ Representation of Jihvamuliya and Upadhmaniya in Kannada Srinidhi

Bengali Script: Formation of the Reph and Yaphala, and use of the ZERO WIDTH JOINER and ZERO WIDTH NON-JOINER

Unicode and Standardized Notation. Anthony Aristar

PROPOSALS FOR MALAYALAM AND TAMIL SCRIPTS ROOT ZONE LABEL GENERATION RULES

Proposal to encode productive Arabic-script modifier marks

Font Features for Lateef

Title: Graphic representation of the Roadmap to the BMP of the UCS

MONTHLY TEST MAY 2017 QUESTION BANK FOR AVERAGE STUDENTS. Q.2 What is free software? How is it different from Open Source Software?

The Unicode Standard Version 6.0 Core Specification

GONDI and GUNJALA GONDI CHARACTER NAMES Vowels EE and OO. Comment on GONDI (L2/15-005) and GUNJALA GONDI (L2/ ) proposals

Proposed Draft: Unicode Technical Report #53 UNICODE ARABIC MARK ORDERING ALGORITHM

ISO/IEC TR Information technology. An operational model for characters and glyphs. Version: 20 July, 1998

DESIGNING A DIGITAL LIBRARY WITH BENGALI LANGUAGE S UPPORT USING UNICODE

Title: Graphic representation of the Roadmap to the BMP, Plane 0 of the UCS

The Unicode Standard Version 6.1 Core Specification

PROPOSAL SUMMARY FORM TO ACCOMPANY SUBMISSIONS FOR ADDITIONS TO THE REPERTOIRE OF ISO/IEC / UNICODE

ISO/IEC JTC 1/SC 2/WG 2 PROPOSAL SUMMARY FORM TO ACCOMPANY SUBMISSIONS FOR ADDITIONS TO THE REPERTOIRE OF ISO/IEC / UNICODE

L2/ Unicode A genda Agenda for for Bangla Bangla Bidyut Baran Chaudhuri

This PDF file is an excerpt from The Unicode Standard, Version 4.0, issued by the Unicode Consortium and published by Addison-Wesley.

Nastaleeq: A challenge accepted by Omega

The Unicode Standard. Version 3.0. The Unicode Consortium ADDISON-WESLEY. An Imprint of Addison Wesley Longman, Inc.

Standardizing the order of Arabic combining marks

Proposal to Encode the Ganda Currency Mark for Bengali in the BMP of the UCS

Deedahwar - Aligarh Urdu Club -

E- MĀDHAVANIDĀNAM - User Manual

ISO International Organization for Standardization Organisation Internationale de Normalisation

Proposal to encode Devanagari Sign High Spacing Dot

Others Symbols, Additional characters proposed to Unicode. Azzeddine Lazrek

ISO/IEC JTC1/SC2/WG2 N3734 L2/09-144R

ZWJ/ZWNJ. behavior under Indic scripts with special reference to chillu, conjuncts, etc in Malayalam. Rajeev J Sebastian Rachana Akshara Vedi

The Unicode Standard Version 6.1 Core Specification

DOWNLOAD OR READ : URDU HINDI DICTIONARY IN DEVNAGRI SCRIPT PDF EBOOK EPUB MOBI

Quarterly Programmatic Report (November, 2013 to February, 2014)

International Cataloging: Use Non-Latin Scripts

L2/09-358R Introduction. Recommendation. Background. Rub El Hizb Symbol. For discussion at UTC and by experts. No action is requested.

The Unicode Standard Version 11.0 Core Specification

The Unicode Standard Version 11.0 Core Specification

IDN Visual Security Deep Thinking. xisigr Feb,2019

SaudiNIC s Proposed Solution. MENOG 8, Khobar, May 14-18, 2011

The Unicode Standard Version 10.0 Core Specification

The Unicode Standard Version 10.0 Core Specification

Character Set Supported by Mehr Nastaliq Web beta version

General Structure 2. Chapter Architectural Context

A. Administrative. B. Technical General L2/ DATE:

Using non-latin alphabets in Blaise

The Unicode Standard Version 12.0 Core Specification

Chapter 3 DATA REPRESENTATION

SOUTH ASIA Indic 1. Tamil Documents: L2/ Naming Tamil Symbols in SMP Vallinam Characters Ganesan

Proposed Update Unicode Standard Annex #34

Multimedia Data. Multimedia Data. Text Vector Graphics 3-D Vector Graphics. Raster Graphics Digital Image Voxel. Audio Digital Video

The Unicode Standard Version 6.0 Core Specification

Feedback on Draft Devanagari Script Behaviour for Hindi Ver 1.4.9

This PDF file is an excerpt from The Unicode Standard, Version 4.0, issued by the Unicode Consortium and published by Addison-Wesley.

ISO/IEC JTC 1/SC 2 N 3426

Information technology Keyboard layouts for text and office systems. Part 9: Multi-lingual, multiscript keyboard layouts

Multilingual mathematical e-document processing

Character Properties 4

Nafees Nastaleeq v1.01 beta

ISO/IEC JTC/SC2/WG Universal Multiple Octet Coded Character Set (UCS)

Instructions On How To Use Microsoft Word 2007 Pdf In Bengali

Request for encoding GRANTHA LENGTH MARK

****This proposal has not been submitted**** ***This document is displayed for initial feedback only*** ***This proposal is currently incomplete***

The Unicode Standard Version 7.0 Core Specification

Building Apps Last updated: 12 June 2017

Unicode definition list

L2/11-033R 1 Introduction

Transcription:

Blending Content for South Asian Language Pedagogy Part 2: South Asian Languages on the Internet A. Sean Pue South Asia Language Resource Center Pre-SASLI Workshop 6/7/09 1

Objectives To understand how South Asian languages work on the internet. To understand what Unicode is. To understand how to create webpages in South Asian languages. 2

What is a digital text? Any text that you see on a computer on a webpage, in a word processor, and so on is treated by the computer program as a series of numbers. 3

Text as Numbers For example, the word Text is a sequence consisting of the following numbers: T: 84, e: 101, x: 120, t: 116. The number a character refers to is called the character s codepoint. The codepoint of a character is usually represented using hexadecimal (base 16) notation instead of decimal (base 10). 4

Numbers in hexadecimal Hexadecimal (base 16) : uses the numbers 0-9 plus the letters A-F [coined in the 1960s to replace sexadecimal] Decimal 5 = Hexadecimal 5 Decimal 9 = Hexadecimal 9 Decimal 10 = Hexadecimal A Decimal 11 = Hexadecimal B Decimal 12 = Hexadecimal C Decimal 15 = Hexadecimal F Decimal 16 = Hexadecimal 10 Why is it used? It s easy to read and computers like the number sixteen (it s 2 x 2 x 2 x 2 after all!). 5

Text in Hexadecimal T= 0054 e = 0065 x = 0078 T = 0074 Note that the codepoints are normally 4 characters long 6

What is Character Encoding? Remember all characters are treated as numbers. Character Encoding refers to the underlying standard or code that governs which character will be represented as what number. An example is the Indian Script Code for Information Interchange (ISCII), which provides a character encoding for many South Asia languages. The most frequently used and versatile contemporary standard for character encoding is Unicode. 7

What is Unicode? Unicode is an evolving industry standard for character encoding that maps the characters of nearly all languages and scripts as a specific number. This unique number provides a universal reference for storing, sorting, and manipulating texts. Unicode is the current cross-platform standard for character encoding. All contemporary operating systems (i.e. OS X, Windows, Linux) and most contemporary computer programs process text following this standard by default. 8

What is in Unicode? With Unicode, there is no limit to how many characters can be added to the standard. There are currently over 100,000 characters mapped in Unicode from over thirty writing systems. Unicode codepoints are usually referred to in their hexadecimal form, often preceded by U+ as in U +200D. Each codepoint also has a character name. 9

Activity Unicode is divided based on script and not language. Explore the Unicode website and find the chart that describes your language s script: http://www.unicode.org/charts/ 10

Unicode is a standard Unicode is only a standard for what number refers to what character. The actual display of characters involves a font. 11

What is a font? A computer font is a data file containing rules for how to display certain characters in a particular typeface. These characters are referred to by numbers. 12

Types of Fonts There are a number of different font formats: OpenType (Microsoft + Adobe): The most versatile, supported by the three major operating systems (Window, Linux, and OS X) Unfortunately, complex scripts using OpenType is not as complete on Apple s OS X as in Windows and Linux Apple Advanced Typography (AAT): Apple only. Good but only works on OS X TrueType: Predecessor of OpenType; can t do complex scripts 13

Fonts Before Unicode Before Unicode, most character encodings could only deal with a character map of 256 entries. To render complex scripts, specific fonts were developed to display specific languages. For a number that would normally display a letter like T for example they would put in a character or series of characters from their font. There are a lot of websites that still require you to download a font. That s why. There is an extension to the Firefox browser called Padma that can convert a number of fonts to Unicode: http://padma.mozdev.org/ 14

What is a Unicode Font? A 'Unicode font' is a font that conforms to the Unicode standard. Like any other font, it provides rules for how to display characters, which are mapped as specific numbers. In a Unicode font, these numbers follow the Unicode standard. A Unicode font may contain the rules for displaying numerous scripts, or maybe just one. For example, the Tahoma font contains rules for how to display the Latin alphabet, as well as the scripts used by Urdu and Thai. 15

Why use Unicode? If you write documents using Unicode, your text can be displayed in numerous Unicode fonts. If you send a document or put it up on the web and your readers do not have the same font you used when writing the text, they can still read it, provided they have some other Unicode font that can display those characters. 16

How do you type in Unicode? In order to type in Unicode, one needs to install or enable an appropriate keyboard. What a keyboard does is map a specific character or sequence of characters to a unicode codepoint. The latest version of most current operating systems (Windows, OS X, Linux) already contain keyboard layouts for most South Asian scripts. Moreover, there are often different layouts available for the different scripts, such as a phonetic option or one closer to the layout used in the type writer of the native script. 17

Where do I find more information? Most operating systems also have an onscreen keyboard feature, which is helpful while learning to type. The South Asia Language Resource Center lists a number of keyboard options on its fonts website. There is a lot of information online. Bring in your computers tomorrow. 18

How Does Unicode Handle South Asian Languages? As a standard, Unicode is more concerned with characters and scripts than with languages. The characters used by different South Asian languages are therefore organized by script. Most often, characters in the same script have numbers that are next to each other in a character block. 19

Where is my character? Currently (in Unicode 5.1) there are twelve Indic script character blocks in Unicode. They are: Bengali, Devanagari, Gujarati, Gurmukhi, Kannada, Limbu, Malayalam, Oriya, Sinhala, Syloti Nagri, Tamil, and Telugu. The characters used by right-to-left languages like Urdu, Pashto, and Sindhi are found on the Arabic character code chart. Tibetan has its own code chart. 20

What are the issues with Left-to- Right South Asian scripts? 1 The twelve Indic scripts and Tibetan all function in a similar way in Unicode, as the writing systems themselves are similar. All of these scripts are segmented in a such a way that one or more consonants are written as a cluster followed by a specific vowel sign, and consonants in a cluster will change their shape either assuming a halfform or a combined form before a following vowel. Unicode is concerned only with the underlying letter and not with the shapes they assume in a word that s left up to the font. 21

What are the issues with Left-to- Right South Asian scripts? 2 Because Unicode is only concerned with letters and vowels, each script is assigned only up to 128 codepoints even though there are many more possible visual forms for these letters. For left-to-right South Asian scripts, Unicode contains codepoints for independent vowels (used when a vowel is not preceded by a consonant), consonants, and dependent vowel signs (used when vowels follow consonants), as well as digits and other various signs. Common punctuation marks are found in the Devanagari range. 22

How to combine LTR letters in Unicode There are a few ways to indicate that consonants should be combined, and they all involve inserting certain characters in between the consonants. The first of these characters is actually a sign in the left-to-right scripts referred to as a virām, halant, or hasant. All of the left-to-right South Asian scripts in Unicode contain this character. 23

How to combine LTR letters in Unicode 2 Here are the combining characters: U+094D DEVANAGARI SIGN VIRAMA U+09CD BENGALI SIGN VIRAMA U+0A4D GURMUKHI SIGN VIRAMA U+0ACD GUJARATI SIGN VIRAMA U+0B4D ORIYA SIGN VIRAMA U+0BCD TAMIL SIGN VIRAMA U+0C4D TELUGU SIGN VIRAMA U+0CCD KANNADA SIGN VIRAMA U+0D4D MALAYALAM SIGN VIRAMA U+0F84 TIBETAN MARK HALANTA U+A806 SYLOTI NAGRI SIGN HASANTA 24

How to combine LTR letters in Unicode 3 When a VIRAMA Unicode character is added in between consonants, the consonants will be displayed in their conjoined form. For example, the typical way to write the Hindi Word kyā (!य ) is: U+0915 क DEVANAGARI LETTER KA U+094D DEVANAGARI SIGN VIRAMA U+092F य DEVANAGARI LETTER YA U+093E DEVANAGARI VOWEL SIGN AA =!य 25

Summary for LTR Languages In summary, the important aspect of the way Unicode deals with left-to-right South Asian scripts is that the conjunct forms are created by adding a VIRAMA character between the consonants. There is no!य character. Instead, it s क + virama + य. There are distinct codepoints for combining vowels and independent vowels, i.e. and अ. 26

What are the issues for Right-toleft languages? Right-to-left South Asian languages use scripts derived from the script used for Arabic. The special characters of South Asian languages have unique codepoints in Unicode. Unicode does not distinguish between variants of rightto-left scripts, such as nastaliq or naskh. It is only concerned with the underlying letters and signs, not with the shape in which they combine together. Therefore, there is no need to worry about initial, medial, or final forms of letters when entering text. 27

What are the issues for Right-toleft languages? 2 Example: The fake word ببب consists only of ب U+0628 ARABIC LETTER BEH (3 times) 28

RTL Ambiguities There appears to be a certain amount of ambiguity in the Arabic code chart because a number of different characters look the same. However, the mapping of South Asian right-to-left languages is now more or less standardized in practice. One reason is the exclusion from popular fonts and keyboards of certain codepoints deemed inappropriate for South Asian languages as they are more properly part of Arabic. 29

RTL South Asia Standards While the letter ARABIC LETTER FARSI YEH (U+06CC) looks identical to ARABIC LETTER ALEF MAKSURA (U +0649), ALEF MAKSURA is often not included in current fonts. While ARABIC LETTER YEH (U+064A) looks the same as ARABIC LETTER FARSI YEH in initial and medial positions, it is best avoided, because many fonts for South Asian languages do not include it. Similarly, ARABIC LETTER HEH DOACHASHMEE (U+006BE) should be used in place of ARABIC LETTER HEH (U +0647), even though they look identical. For the regular heh, the preferred codepoint is ARABIC LETTER HEH GOAL (U+06C1). 30

RTL Standards 2 The Center for Research in Urdu Language Processing (http://crulp.org), which is leading the development of fonts and keyboard for Pakistani languages on a whole, has managed to curtail the use of numerous Arabic characters by not including them in their fonts or eliminating them from their keyboards. As a result, Urdu and other Pakistani languages on the web are more standard than Persian, for example. 31

RTL Issue: Hamza The Unicode codepoint used for the seated hamza is ARABIC LETTER YEH WITH HAMZA ABOVE (U+0626). It will look a yeh with a hamza above it on the keyboard. The ARABIC LETTER HAMZA (U+0621) is used in the standalone, unconnected position. There is a third codepoint, ARABIC HAMZA ABOVE (+0654), which renders a hamza above a letter, such as BAREE YEH, VAO, and so on. 32

How to put Unicode on a webpage? For the purposes of material developed online for SASLI, you can just type in Unicode and it will be saved appropriately on the webpage. The reason is that the webpage is already setup for what is called the utf-8 character set. As long as the header of the HTML or XHTML file shows that it will be encoded in the unicode character set, you are good to go. (Activity: Show how this done and demonstrate) 33