Can R Speak Your Language?

Similar documents
Representing Characters, Strings and Text

UTF and Turkish. İstinye University. Representing Text

Representing Characters and Text

2011 Martin v. Löwis. Data-centric XML. Character Sets

2007 Martin v. Löwis. Data-centric XML. Character Sets

Google Search Appliance

The process of preparing an application to support more than one language and data format is called internationalization. Localization is the process

SAPGUI for Windows - I18N User s Guide

IT 1204 Section 2.0. Data Representation and Arithmetic. 2009, University of Colombo School of Computing 1

Part III: Survey of Internet technologies

Lecture 25: Internationalization. UI Hall of Fame or Shame? Today s Topics. Internationalization Design challenges Implementation techniques

COM Text User Manual

Tex with Unicode Characters

umber Systems bit nibble byte word binary decimal

Using non-latin alphabets in Blaise

Attacking Internationalized Software

1 Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Picsel epage. PowerPoint file format support

Transfer Manual Norman Endpoint Protection Transfer to Avast Business Antivirus Pro Plus

Attacking Internationalized Software

10 Steps to Document Translation Success

Pan-Unicode Fonts. Text Layout Summit 2007 Glasgow, July 4-6. Ben Laenen, DejaVu Fonts

Chapter 4: Computer Codes. In this chapter you will learn about:

COSC 243 (Computer Architecture)

Thomas Wolff

Displaying Chinese Characters In Blaise

Chapter 11 : Computer Science. Information Representation. Class XI ( As per CBSE Board) New Syllabus

Compatibility matrix: HP Service Manager Software version 7.00

Transfer Manual Norman Endpoint Protection Transfer to Avast Business Antivirus Pro Plus

Global Deployment Guide. Version 7.8, Rev. A November 2005

Picsel epage. Word file format support

Multilingual vi Clones: Past, Now and the Future

Beyond Base 10: Non-decimal Based Number Systems

CS144: Content Encoding

2. bizhub Remote Access Function Support List

Digital Logic. The Binary System is a way of writing numbers using only the digits 0 and 1. This is the method used by the (digital) computer.

The Use of Unicode in MARC 21 Records. What is MARC?

Intermediate Programming & Design (C++) Notation

*North American Standard USB Keyboard

Extended Character Sets for UCAS Systems

Easy-to-see Distinguishable and recognizable with legibility. User-friendly Eye friendly with beauty and grace.


Chapter 7. Representing Information Digitally

Version 5.5. Multi-language Projects. Citect Pty Ltd 3 Fitzsimmons Lane Gordon NSW 2072 Australia

Our Hall of Fame or Shame candidate for the day is this interface for choosing how a list of database records should be sorted.

D16 Code sets, NLS and character conversion vs. DB2

PLATYPUS FUNCTIONAL REQUIREMENTS V. 2.02

Beyond Base 10: Non-decimal Based Number Systems

An Introduction to the Internationalisation of econtent

Friendly Fonts for your Design

Licensed Program Specifications

The Unicode Standard Version 11.0 Core Specification

Push button sensor 3 Plus - Brief instructions for loading additional display languages Order-No , , 2042 xx, 2043 xx, 2046 xx

Administrative Notes February 9, 2017

Wordman s Production Corner

Casabac Unicode Support

III-16Text Encodings. Chapter III-16

REPRESENTING INFORMATION:

HP Enterprise Collaboration

MetaMoJi Share for Business Ver. 3 MetaMoJi Note for Business Ver. 3 Administrator s Guide

Manual Rosetta Stone Portuguese Mac Italian

Localizing Intellicus. Version: 7.3

Reviewed by Tyler M. Heston, University of Hawai i at Mānoa

ICANN IDN TLD Variant Issues Project. Presentation to the Unicode Technical Committee Andrew Sullivan (consultant)

Course Schedule. CS 221 Computer Architecture. Week 3: Plan. I. Hexadecimals and Character Representations. Hexadecimal Representation

uptex Unicode version of ptex with CJK extensions

CUPS Translation Guide CUPS TRANS 1.1. Easy Software Products Copyright , All Rights Reserved

Japanese utf 8 font. Japanese utf 8 font.zip

IBM Full Width Quiet Touch Keyboards now available on IBM System p servers and selected IntelliStation workstations

IBM DB2 Web Query for System i V1R1M0 and V1R1M1 Install Instructions (updated 08/21/2009)

IPDS Emulation. Selects the Data Stream parameter to the following values: - IPDS (default) - ASCII LAN INTERFACE PARAMETERS

Release Notes. for Kerio Operator 1.2.1

LBSC 690: Information Technology Lecture 05 Structured data and databases

CMPS 10 Introduction to Computer Science Lecture Notes

Collations in MySQL 8.0

Character Encodings. Fabian M. Suchanek

PROC SORT (then and) NOW

ES01-KA

UNITED STATES GOVERNMENT Memorandum LIBRARY OF CONGRESS. Some of the proposals below (F., P., Q., and R.) were not in the original proposal.

Computer Organization and Assembly Language. Lab Session 4

EMu Documentation. Unicode in EMu 5.0. Document Version 1. EMu 5.0

Conversion of Cyrillic script to Score with SipXML2Score Author: Jan de Kloe Version: 2.00 Date: June 28 th, 2003, last updated January 24, 2007

MetaMoJi Share for Business Ver. 2 MetaMoJi Note for Business Ver. 2 Installation and Operation Guide

WebTransactions V7.1. Supplement

MULTINATIONALIZATION FOR GLOBAL LIMS DEPLOYMENT LABVANTAGE Solutions, Inc. All Rights Reserved.

Multimedia and Web Design (MWD) Skill Area 324: Develop Multimedia Application

The Role of the Local Alphabet in Enhancing of the Development of Digital Communication and the Associated Problems

Note 8. Internationalization

IBM Infoprint Fonts for Multiplatforms V1.1 Extends Global AFP Font Support

TECkit version 2.0 A Text Encoding Conversion toolkit

Network Working Group. Category: Informational July 1995

St. Benedict s High School. Computing Science. Software Design & Development. (Part 2 Computer Architecture) National 5

PROC SORT (then and) NOW

System Demonstration TRADOS TRANSLATOR'S WORKBENCH


Review PDF Converter for Windows software downloader cnet ]

Customer Release Notes - Release Build

Book Size Minimum Page Count Maximum Page Count 5x8 B&W x9 B&W x11 B&W x8.5 Color x11.

Prior Art Database Keyword Search Guide 1

KYOCERA Quick Scan v1.0

Transcription:

Languages Can R Speak Your Language? Brian D. Ripley Professor of Applied Statistics University of Oxford ripley@stats.ox.ac.uk http://www.stats.ox.ac.uk/ ripley The lingua franca of computing is (American) English. R uses English for its keywords and its (several thousand) built-in functions. However, the data used in computing can be in any human language (or none). One issue is to represent that data: we will discuss character sets (aka charsets) shortly. Another is to display the data in a human-readable form, both on the console and on graphics devices. Part of the data is information related to the packages themselves. This includes the manuals, help files and so on. The terser the information, the harder it is for non-native speakers to comprehend: this applies especially to menus and information/warning/error messages. Internationalization (aka i18n) is the process of enabling support for data and messages in different languages. Localization (aka i10n) is the process of adding language-specific features, such as translations of messages. Character Sets Long ago encodings such as EBCDIC and ASCII were developed to represent characters in decimal and binary computers. The A abbreviates American, and these only handled upper- and lower-case letters A-Z, digits, punctuation symbols, American currency ($ but not cent), and some odds such as # % * + / \ ^ _ ~. Characters vs glyphs A character is an abstract concept, and a glyph is a visual representation of it. Did ASCII mean minus or hyphen by -? It matters both in interpretation and display in proportionally-spaced fonts. What is? An apostrophe, a quotation mark, an accent? What is? A quotation mark or an accent? What is ~? A tilde accent or more like? Character Sets II People writing non-american English (even UK English) need other characters sets. Initially there were many proprietary extensions (DEC, HP, MacRoman, Windows), but a fairly small number of character sets became common. These represent up to 255 characters, using a single 8-bit byte for each character. Examples include ISO 8859-1 (aka Latin 1) and KOI8-R. Single-byte encodings cover most of the world s languages. (Vietnamese uses perhaps the most characters of these.) The exceptions are Chinese, Japanese, Korean (CJK) and languages with diacritical marks like Thai. CJK languages have tens of thousands of characters (and more glyphs). A number of shift coding systems were developed, such as EUC-JP and the Windows code pages CP932, 936, 949 and 950. Note that these character sets only allow a small set of languages at a time (in addition to English), not for example German and Polish.

Unicode The grand vision was to have a character set to represent all human languages at once that would be used everywhere. The nearest we have in Unicode. This has 2 21 slots for characters, and most have now been filled with human languages and other forms of notation. That leaves the question of how those 2 21 slots are represented as data. Two main ways have emerged. UTF-8 Each character is represented by one to four bytes, with ASCII using one. [Used by most Unix-alikes, but only recently.] UCS-2 Most characters are represented by two bytes, those with code over 2 16 by four. [Used by NT-based Windows for a decade, but characters beyond 2 16 are turned off by default.] Unfortunately, OSes that support one tend not to support the other, so R has to cope with both, as well as with older OSes (e.g. Windows 98) that support neither. Internationalization of R The major task was to add support of multi-byte character sets (MBCS), which was added in R 2.1.0 in April 2005. By that time Linux systems set up to use UTF-8 were becoming common, and MacOS X is in part based on UTF-8. The task was non-trivial, as all operations on character strings had to be reviewed and if necessary converted to work with characters and not bytes. At the same time, the standard GNU gettext facilities were grafted onto R to provide for the translation of messages. Again, this was a larger task that might be supposed as the gettext system was not designed for an extensible system, and also for a Unix-alike rather than Windows. Least trivial of all, someone had to mark up messages for translation, and the opportunity was taken to rationalize and improve the error messages. Also at the same time, iconv was added to allow translation between character sets (as seamlessly as possible) as well as mark-up for the character encodings used in documentation. Localization of R Localized RGui Consoles Once translation facilities were in place, volunteers set up translation teams. We now have partial translations for Brazilian Portuguese, Chinese (both Simplified and Traditional orthography), English (as opposed to American), French, German, Italian, Korean, Russian and Spanish. and it is an on-going effort to keep these current. There are separate localization efforts for the Windows installer and for the MacOS X console (Dutch, French, German, Italian and Japanese). Another issue for the CJK languages is that they are displayed with wider glyphs than ASCII, so that there are no usable mono-spaced fonts. This means that consoles have had to be rewritten to cope with variable-width characters.

French Russian Japanese Inputting Text How does the user input text in a particular language? One way is to use a suitable keyboard, but even with virtual keyboards that can be tedious if several languages are involved. Sometimes it is possible to write separate documents in UTF-8 or UCS-2 and combine them in a suitable editor. Unicode provides a number (the Unicode point) for all known characters. This is usually written in hex, e.g. U+201C (left double quotation mark). R 2.3.1 allows this to be entered in the form \u201c wherever the character is representable in the character set within which R is running. Characters can always be input in a particular encoding as one or more bytes by octal or hex escapes, e.g. ń is 0xF1 in Latin 2.

Internationalization of Graphics Devices Displaying non-ascii glyphs is an issue for text output, but one we can leave to the OS. For graphical output, it is an issue we have to address, especially for portable forms of graphics like PDF. R 2.1.0 provided support for the screen devices, provided suitable fonts are available. Ensuring that suitable fonts are available can be tricky, not least on X servers which do not give priority to finding the right encoding (essentially because there are very few Unicodeencoded fonts to choose from). R 2.3.0 provided more support: e.g. on NT-based Windows Unicode output is used which enables you to output graphics in a different locale from that R is running in, and also to run R is a locale different from that R is running in. PostScript and PDF devices PostScript and PDF provided a considerable headache, one that Ei-ji Nakama, Paul Murrell and I worked on for a long time. The relatively easy part was to provide support for 8-bit encodings, and we have that for almost all Western and Eastern European languages, including Greek and Cyrillic (but only for one language group at a time). This can be extended by adding suitable coding files. However, Adobe does not make use of Unicode, and the CJK languages are displayed in so-called CID fonts with private encodings. R 2.3.0 added support for CJK in a small range of easily-available CID fonts for postscript and pdf. There are further issues: see the article by Paul Murrell and myself in the very latest R-News. Portability Issues The i18n of R allows you to write in your native language. Should you? Probably not if your work is to be used outside a strictly controlled environment. Once you use any non-ascii character, there is no guarantee that your users will see the character you intended, or any character at all! Which object names are syntactically valid depends on the locale. It is possible to build R with no support for MBCS or i18n. It is not uncommon for commercial Unixen to have no support beyond Latin 1. Microsoft sells Non-international English versions of Windows that have no support for changing language (it even sells them in Europe). And it is possible to run R in the C locale, which strictly only allows ASCII characters. R enforces many of these restrictions on packages, assuming that they will be run in different environments. This is unfortunate for those who would like, e.g. to display a banner giving their authorship, but if they realize their names will most often be mangled they are likely to prefer a transliteration. We started with Does R speak your language? Conclusion

Conclusion We started with Does R speak your language? and the answer is R is capable of speaking your language. For most people it has already been taught how. If you are one of those for which R only has potential, please consider contributing the localization needed.