Crossing the Digital Divide: computer resources to aid minorities

Similar documents
Software Applications for Cultural Diversity

Extensible Rendering for Complex Writing Systems

Keyman, LANGIDs & Codepages

Two distinct code points: DECIMAL SEPARATOR and FULL STOP

Can R Speak Your Language?

The Unicode Standard Version 11.0 Core Specification

Introduction 1. Chapter 1

Omni Dictionary USER MANUAL ENGLISH

1. Introduction 2. TAMIL LETTER SHA Character proposed in this document About INFITT and INFITT WG

Workshop On Empowering The Poor Through Rural Information Centers:

CreeKeysDesktop3 - Cree Syllabics Keyboarding for Windows

Reviewed by Tyler M. Heston, University of Hawai i at Mānoa

INTERNATIONALIZATION IN GVIM

Using the FirstVoices Kwa wala Keyboard

What is Information Technology. Chapter 1: Computer System. Why use Information Technology? What is Data?

Proposal to encode Devanagari Sign High Spacing Dot

Pridi Banomyong International College Thammasat University, Bangkok Thailand

Survey of Language Computing in Asia 2005

COMPUTER-AIDED PROCESSING OF CHINESE LEXICOGRAPHIC MATERIALS

Keyboard Version 1.1 designed with Manual version 1.2, June Prepared by Vincent M. Setterholm, Logos Research Systems, Inc.

Assessment of the progress made in the implementation of and follow-up to the outcomes of the World Summit on the Information Society

Multilingual mathematical e-document processing

Reference Framework for the FERMA Certification Programme

Title: Application to include Arabic alphabet shapes to Arabic 0600 Unicode character set

MAPPING THE WORLD OF WOMEN S INFORMATION SERVICES

Imagine Your Life Without the Internet

Long-Term and Cross-Disciplinary Fellowships Instructions for host supervisors

Policy consideration for Development and Deployment of Local language computing and

Keyboards for inputting Chinese Language: A study based on US Patents

Blending Content for South Asian Language Pedagogy Part 2: South Asian Languages on the Internet

What s new since TEX?

The Use of Unicode in MARC 21 Records. What is MARC?

ITU EXPERT GROUP ON HOUSEHOLD INDICATORS (EGH) BACKGROUND DOCUMENT 3 PROPOSAL FOR A DEFINITION OF SMARTPHONE 1

LEKHAK [MAL]: A System for Online Recognition of Handwritten Malayalam Characters

LBSC 690: Information Technology Lecture 05 Structured data and databases

The Unicode Standard Version 12.0 Core Specification

1 ISO/IEC JTC1/SC2/WG2 N

Yes 11) 1 Form number: N2652-F (Original ; Revised , , , , , , , 2003-

Proposal to enhance the Unicode normalization algorithm

International Journal of Advanced Research in Computer Science and Software Engineering

Spreading the ICT Gospel With Cisco Network Academy Program

1. Introduction 2. TAMIL DIGIT ZERO JTC1/SC2/WG2 N Character proposed in this document About INFITT and INFITT WG

ISO Registration Authority Request for Change to ISO Language Code

ISO/IEC JTC1/SC2/WG2 N3864R L2/10-254R

Reply to L2/10-327: Comments on L2/10-280, Proposal to Add Variation Sequences... 1

Ariyaka The early typeface leads modern industrialization of letterpress printing in Thailand.

Getting started. 1. Applying the keyboard labels. 2. Installing the Lakota Keyboard and Font Bundle

Summary of Bird and Simons Best Practices

Computer Science Applications to Cultural Heritage. Introduction to computer systems

Rendering in Dzongkha

Toward Interlinking Asian Resources Effectively: Chinese to Korean Frequency-Based Machine Translation System

TCM Health-keeping Proverb English Translation Management Platform based on SQL Server Database

Easy English fact sheet

The Grid 2 is accessible to everybody, accepting input from eye gaze, switches, headpointer, touchscreen, mouse, and other options too.

ISO/IEC JTC 1/SC 2/WG 2 Proposal summary form N2652-F accompanies this document.

of the pyramid mwomen Working Group November 9-10, Chennai, India

High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore

Functional Skills. Code Breaking. LO: Use logic and frequency analysis to read and write in code.

Game keystrokes or Calculates how fast and moves a cartoon Joystick movements how far to move a cartoon figure on screen figure on screen

Who Said Anything About Punycode? I Just Want to Register an IDN.

Prepared Testimony of. Bohdan R. Pankiw. Chief Counsel Pennsylvania Public Utility Commission. before the

DESIGNING A DIGITAL LIBRARY WITH BENGALI LANGUAGE S UPPORT USING UNICODE

Transliteration of Tamil and Other Indic Scripts. Ram Viswanadha Unicode Software Engineer IBM Globalization Center of Competency, California, USA

Network Working Group. Category: Informational July 1995

Frogans Technology Overview

ISO/IEC JTC/1 SC/2 WG/2 N2312. ISO/IEC JTC/1 SC/2 WG/2 Universal Multiple-Octet Coded Character Set (UCS)

Communication through the language barrier in some particular circumstances by means of encoded localizable sentences

NTCA 2016 WIRELESS SURVEY REPORT

In this example on 20X200, the contrast ratio between the navigation text and its background is 1.57:1.

Transitioning a Vastly Multilingual Corporation to Unicode

Developing a Basic Web Page

Proposal to encode three Arabic characters for Arwi

Sustainable File Formats for Electronic Records A Guide for Government Agencies


Information and documentation Romanization of Chinese

Using non-latin alphabets in Blaise

STARTING AN APPLICATION To begin your application, click the BEGIN YOUR APPLICATION button on the Harkness Application page.

A. Administrative. B. Technical General

Appendix: Data Availability Policies & Replication Policies Time of Evaluation: January 2012

All Localized. Your Localization Services Partner. Company Profile

UNITED STATES GOVERNMENT Memorandum LIBRARY OF CONGRESS. Some of the proposals below (F., P., Q., and R.) were not in the original proposal.

UNITED STATES GOVERNMENT Memorandum LIBRARY OF CONGRESS

Wimba Pronto. Version 2.0. User Guide

2 Sets. 2.1 Notation. last edited January 26, 2016

Proposal to Encode Phonetic Symbols with Retroflex Hook in the UCS

Designing & Developing Pan-CJK Fonts for Today

E Government in Tonga

Survey of Language Computing in Asia 2005

An Experiment in Visual Clustering Using Star Glyph Displays

Standardization and Implementations of Thai Language

Refinement of digitized documents through recognition of mathematical formulae

Proposal to Encode the Ganda Currency Mark for Bengali in the BMP of the UCS

Proposal to encode fourteen Pakistani Quranic marks

Questions and answers on the revised directive on waste electrical and electronic equipment (WEEE)

Note 8. Internationalization

Pan-Unicode Fonts. Text Layout Summit 2007 Glasgow, July 4-6. Ben Laenen, DejaVu Fonts

Proposal to encode the DOGRA VOWEL SIGN VOCALIC RR

Freedom of Access to Information and the Digital Divide: The Answer s in the Palm of your Hand. Paul Sturges Loughborough University

A network is a group of connected, communicating devices such as computers and printers. An internet

TRANSLATION GUIDELINES

Transcription:

Crossing the Digital Divide: computer resources to aid minorities Martin Hosken i and Melinda Lyons ii Introduction The term digital divide has been used frequently over the past five years to express the idea that certain people groups have less access to computing resources than others. Some aspects of the problem have been well explored in many documents. These include the financial problems of computer access, the lack of telecommunications resources in remote areas and a lack of basic literacy which would lead to computer use. These issues affect hundreds of millions of people around the world. A lesser-known problem is the lack of program and script support for minority groups, especially where those groups use non-roman scripts. Over the past fifty years, computer resources became available to many of the majority peoples of Asia, especially as their economies became important to the world economic system. Large corporations developed programs and script support for the major languages. However, linguistic minority groups have often been passed by in this important area. In some cases the specialized scripts of these minority peoples are not yet available on most computers. In other cases the representation of a minority language using the national script cannot be fully realized using currently available software. This paper will outline the historical bases for this part of the digital divide. Some examples will be presented for the current situation. It will go further to outline the major components of a solution to the problem. Unesco and SIL have joined together to address some major aspects of the solution, and these will be discussed in the final section. What is the Digital Divide? The digital divide is the disparity in access and use iii of various information and communications technologies. Many developed countries have noted the difficulty of providing computing and telecommunication resources to their poorest, illiterate or older populations. Also, those in rural areas are less likely to use these new technologies. In some instances language is also a factor related to computer or internet use, but in most cases the languages referred to are ones for which computer technologies are available. Even so, few documents on the digital divide from either the developed or developing world include language access as a factor in the digital divide. There is yet another and deeper divide in ability to use information resources and technologies for those whose languages cannot currently be represented by available computing technologies. Millions of people speak and read languages which cannot be adequately or easily represented by computers. For these people, the problem is not one of access to telephone or computer, but of programs and fonts that enable them to use the new technology. Until minority peoples can work on computer like everyone else, the divide will not be overcome. The Problem The earliest computers were used exclusively for mathematical calculation. As computer technology developed, the earliest users required only the basic characters used to write European languages. This included the English alphabet plus some additional letters with accents or other marks, punctuation marks, and some additional characters such as currency symbols. At first, only special technicians had access to the new technology for writing and printing. As computer technology was available to people in Eastern Europe and Asia, new problems were created of providing the characters for input and output in new languages. Technicians would handle the computer while general users were content with the printed documents that resulted. Data could not be easily shared between computers or systems, even though English users were creating complex databases. It was not until the decade of the 90 s that Chinese language general database programs were available, that could be used to handle and share data readily.

Early Solutions Few people in the 1980 s were interested in developing an ability to represent minority languages on computer. Most who did were using Roman (European) letters, using the computer to produce printed literature for the minorities. A very small group of academics wanted to study minority Asian or African languages with unusual scripts. Often these people would request help from a technically savvy friend, who would enable simple input and output of the minority language. If unusual letter shapes were needed, a rough font would be produced to display or print. During the eighties and early nineties individual solutions were common. It was cumbersome to produce multilingual documents. The letters for each script were on separate codepages in the Microsoft system. Each section of text needed to indicate the source of the letters. Combining text from different scripts on one page required a lot of work, or was impossible. As large corporations expanded their operations to Asia, more users needed to combine Western and Asian scripts. Computer companies turned their attention to solving some of the problems. This enabled large sections of the population to use computers readily, but also created additional problems for minority groups using the national script. Computer Issues in Asian Language Development The history of language and literacy spread in Asia has created some unique problems for computer users in the region. Reading and writing spread from the Indian subcontinent across to Southeast Asia, with each new people group slightly modifying the letter shapes over time. The current result is that there are fifteen to twenty somewhat similar scripts in use across Southeast Asia. Four of these are national scripts, but many of the larger minority groups still use their own writing system for their own language. Most of these minority groups do not have adequate computer resources (see the website for the Script Encoding Initiative iv for further details). The more recent history of language development in Asia has created additional problems for those providing computing resources. Current plans and programs use the national language script as the basis for minority language literacy programs. Often, the national language is from a different language family than the minority group and its symbol system does not readily represent all the sounds of the minority language. In adapting the letters of the national language to represent the minority language, a language worker may use the symbols in a way which cannot be represented using current computer programs. An example of the latter is seen when the Thai script is used to represent Katuic languages of Eastern Thailand. The Thai language has fifteen vowel glyphs used to represent the twenty-one distinctive vowels of Thai either singly or in specific combinations. The Bru language has a more extensive set of contrastive vowels, about forty or so. One proposed orthography uses the Pali dot as an additional vowel symbol to aid in distinguishing these vowels. v However, the implementation of Thai using current software does not preserve this symbol where it is needed. It is considered a mistake and is eliminated. Why is a General Solution Important? There are several reasons for solving the general problem of providing computer resources for all languages in a freely available way. First, it will empower local people, enabling any and all to do local language development work. There will be no distinction between people groups based on available and unavailable resources. Second, it will reduce the need for special technical support for those doing language development from within or outside the minority group. As a large organization involved with language development, SIL has found technical support to be one of the great challenges. Finally, computer literacy or work skills learned in one environment will transfer readily across language boundaries. Minority people with computer skills will be able to work in the national or neighboring languages without any special or additional training. Another reason for needing a general solution is illustrated by the situation mentioned above: the technology developed must allow adaptation by minorities who have different language needs than those of the national majority. The glyphs should permit combinations in any order needed to represent the sounds in the minority language. One last feature of a general solution is that it would permit data storage and retrieval in any random or organized fashion. This will facilitate database and other information technology uses, perhaps allowing a minority user to store information on prices for local commodities over time, further improving local life. A worker in a local health facility could track births and deaths or other health information over time. But this is only possible when the local language can be used with other applications and programs.

Approach to the Problem The solution to the problem of computer access for language development requires several components. The first and largest piece is a common storage system, encompassing all languages. The second and third parts of the problem are the ability to enter data into the system and to retrieve it commonly these are keyboarding (as a method of entry) and rendering (the process of converting stored information to be viewed on the screen or printed on paper). The last two components of the large problem are the ability to analyze data and to convert data from one form to another. The focus of this discussion will be on the first three problems. Encoding Encoding is the means by which information is stored in the computer, using a special binary sequence for each character. As computing in different languages developed, each language used one codepage for all the characters and punctuation of the language. The computer system would tie the input and output to the 256 characters of a particular codepage. The difficulties of using codepages to encode various languages came to international attention in the 1980s, as corporations were attempting to produce various versions for different countries. Users wanted to use more than one or two languages at a time for business interactions, but found themselves hindered. In 1988 a committee was formed to discuss this problem of multiple encodings. Four design principles were agreed to for the new encoding, which they called Unicode. These four principles stated: 1) It should be a universal standard covering all writing systems 2) It should avoid special coding sequences or states which might vary between systems 3) It should provide a uniform coding width for each character 4) The same value should always represent the same character Following that meeting, Unicode was created in 1991. The Unicode encoding system has space for all writing systems, although some have not been added yet. There are several minority languages of Asia yet to be added to the Unicode scheme, such as Cham, Khamti, Kayah Li, Lanna, and Pahawh Hmong. Proposals have been submitted to the Unicode Technical Committee for some of these languages, but work still needs to be done to finalize their encodings. The Script Encoding Initiative of the University of California at Berkeley has been set up to address this need, but welcomes input from researchers directly involved with the languages. The Script Encoding Initiative sees the need to incorporate minority scripts For a minority language, having its script included in the universal character set will help to promote native-language education, universal literacy, cultural preservation, and remove the linguistic barriers to participation in the technological advancements of computing. vi There were several design principles incorporated into Unicode which have importance for this discussion. First, one codepoint represents one character (or letter). Accents or letters that are placed over or under other letters are separate codepoints, and rules show how they are combined. When two or three symbols are combined for a single sound (such as เ อ in Thai), these are each separate codepoints. Secondly, Unicode stores characters in logical order, or the order the sounds occur in the word. For languages which use right-to-left writing systems, this means that Unicode stores the right-most character first, unlike with European written order. Third, Unicode stores plain text no formatting for text is a part of Unicode. Input Methods The second component of a general solution for minority computing is that of data input. Currently, the usual method of data input is by using a keyboard. Each key on the keyboard is tied to a particular codepoint using specialized software. The difficulties lie in assigning particular keystrokes to their codepoints, and in the cases where characters need to be combined. When developing a keying system, two main ways are possible. The mnemonic system uses information printed on the keycaps to indicate what the key will type. For an English speaker typing Thai, the L key might have a lower case ล and the upper case might be the less common ฬ. A positional keyboard considers the relative positions of keys on the keyboard and often is a keyboard which has the frequent letters typed by lower case central keys, while less frequent letters are upper case or peripheral keys. For very large character sets, typing a single or pair of characters will open a candidate window. The correct character can be selected from the options presented.

For those needing more than a hundred or so different combinations, but not several thousand, modifier keys or dead keys can be used to provide additional combinations. Modifier keys are ones such as Ctrl Alt or the Ctrl+Alt combination. Often, however, these are used by the application programs, and their use for typing characters can be confusing to the application, or cause unexpected results. Dead keys are ones which display no character until a second character in the sequence is typed. This allows rapid input, but can cause confusion if one does not remember what has already been entered. One solution to the problems with dead keys is to use the modifier key after the main character, with each result displaying on the screen so the typist knows what has been entered. This works well if each character or state can be displayed by the font, but not all characters can be displayed by all fonts. When an intermediate state is reached that cannot be displayed, it can cause confusion. Another problem with each of the entry systems is how to delete specific characters represented by a Unicode string. When the modifier key system is used, each state may be deleted separately. However, when a dead key is used, there is no way to know how many keystrokes must be deleted to get to the state needed for further entry. There is one input program called Keyman that handles complex script input well. It works for those using Microsoft products, and handles the input using modifier keys. There is a second program for Open Source applications called SCIM, which is currently under development. It should work equally well for those using a Linux operating system environment. Output Issues After solving the problems of input and storage, generally data must be displayed on a computer screen or printed on paper. This process requires that an encoded string of characters be converted to a glyph string. Glyphs are the visual representations of the underlying characters. This process of converting encoded data to a visual presentation is called rendering. The basic process involves two steps: 1) identifying the glyph which corresponds to a Unicode string, and 2) rendering (printing or displaying) the glyph on the paper or screen at the proper location. Some problems of the early rendering systems have been solved by the newer smart font technologies. Early technologies required the rendering system to use one codepoint for one glyph in one position, thus sometimes resulting in 4 different codes needing to be stored for a single tone mark, one for each position where it could be rendered. Smart fonts have a two stage processing approach, the first step is glyph substitution and the second is glyph positioning. Glyph substitution ensures that the correct glyph is selected in the right context (when there are specialized characters for different positions in a word, for instance). Glyph positioning fits the glyph in the right place relative to the baseline and other glyphs. In smart-font rendering these two steps are added, so that there is a four-step process: 1) converting the Unicode string to glyphs, 2) process the glyph string to a new glyph string, 3) position the glyphs relative to each other, and 4) rendering the glyph on paper or screen. For more information on rendering technologies, see Hosken, 2001b below. Graphite A new rendering system called Graphite has been developed by SIL because of problems and limitations of the previous systems. Graphite uses smart font rendering to enable the use of any glyph in any position needed for a minority script. It gives the font developer complete control of all aspects of script rendering. Those working with minority scripts can develop and use it independently of others. Because it can be used for any language, the skills developed are readily transferred to another language or script. For more information about the reasons Graphite was developed see Hosken 2002 Rendering Thai as an Example of Using Graphite for Generalised Smart Font Rendering, Graphite technology is being incorporated into various products for word processing and Internet access. There is already a simple text editor called WorldPad which uses the Graphite rendering system. Development is in progress on a Graphite-enabled Mozilla web browser and a Graphite-enabled office suite called Open Office. Each of these tools will further enable minority peoples to use computing resources and promote language development. Conclusion The digital divide is a real barrier to those in the developing world who cannot use computing resources because no provision has been made for their unique scripts. This paper outlines many reasons why the current technologies are inadequate to the task of providing computing resources to small minority groups in Asia. The

new Graphite technology has the potential to provide an answer to this problem and allow many more people the chance to further develop their languages and communities. Bibliography Department of Linguistics. U C Berkeley. 2003.Script Encoding Initiative. http://www.linguistics.berkeley.edu/~dwanders/alpha-script-list.html Department of Linguistics. U C Berkeley. 2003.Script Encoding Initiative. Home Page http://www.linguistics.berkeley.edu/~dwanders/#list Digital divide. 2002. National Office for the Information Economy, Commonwealth of Australia. http://www.noie.gov.au/projects/access/connecting_communities/digitaldivide.htm. Green, Julie and Feikje van der Haak. 2002. The Bru people in Khong Chiem, Ubon Ratchathani in Minority language orthography in Thailand: five case studies. Bangkok, Thailand: TU-SIL-LRDP Committee, Thammasat University. Hosken, Martin. 2002. Rendering Thai as an example of using Graphite for generalized smart font rendering. SNLP-Oriental COCOSDA 2002. Hosken, Martin. 2001. An Introduction to keyboard design theory: what goes where? In Implementing writing systems: an introduction, edited by Melinda Lyons. Dallas, Texas: SIL, Non-Roman Script Initiative, 121-137. Hosken, Martin. 2001. An Introduction to smart font rendering: glyph tracing for beginners. In Implementing writing systems: an introduction, edited by Melinda Lyons. Dallas, Texas: SIL, Non-Roman Script Initiative, 139-151. Lyons, Melinda. 2001. Graphite and WorldPad: tools for writing the world s other languages. In Techknowlogia, November-December 2001, 51-54. i SIL International and Payap University ii SIL International iii Digital divide. 2002. National Office for the Information Economy, Commonwealth of Australia. http://www.noie.gov.au/projects/access/connecting_communities/digitaldivide.htm. iv Department of Linguistics. U C Berkeley. 2003.Script Encoding Initiative. http://www.linguistics.berkeley.edu/~dwanders/alpha-script-list.html v Green, Julie and Feikje van der Haak. 2002. The Bru people in Khong Chiem, Ubon Ratchathani in Minority language orthography in Thailand: five case studies. Bangkok, Thailand: TU-SIL-LRDP Committee, Thammasat University. vi Department of Linguistics. U C Berkeley. 2003.Script Encoding Initiative. Home Page http://www.linguistics.berkeley.edu/~dwanders/#list