uptex Unicode version of ptex with CJK extensions

Similar documents
ATypI Hongkong Development of a Pan-CJK Font

Tex with Unicode Characters

2011 Martin v. Löwis. Data-centric XML. Character Sets

2007 Martin v. Löwis. Data-centric XML. Character Sets

Recent Trends in Standardization of Japanese Character Codes

Typesetting CJK languages with Omega

Building Source Han Sans & Noto Sans CJK

UTF and Turkish. İstinye University. Representing Text

Can R Speak Your Language?

Representing Characters and Text

Designing & Developing Pan-CJK Fonts for Today

Multilingual vi Clones: Past, Now and the Future

Network Working Group. Category: Informational July 1995

The Unicode Standard Version 11.0 Core Specification

AFP Support for TrueType/Open Type Fonts and Unicode

Bookmarks for PDF Output(Outline-Group)

The Adobe-Japan1-6 Character Collection

Representing Characters, Strings and Text

Unicode definition list

The Power of Plain Text & the Importance of Meaningful Content Dr. Ken Lunde Senior Computer Scientist Adobe Systems Incorporated

Character Encodings. Fabian M. Suchanek

What s new since TEX?

SAPGUI for Windows - I18N User s Guide

Using Sweave and patchdvi with Japanese text

Collations in MySQL 8.0

# or you can even do this if your shell supports your native encoding

Thomas Wolff

Extensions for the programming language C to support new character data types VERSION FOR PDTR APPROVAL BALLOT. Contents

Practical character sets

D16 Code sets, NLS and character conversion vs. DB2

Typesetting Thai With LaTeX

by Martin J. Dürst, University of Zurich (1997) Presented by Marvin Humphrey for Papers We Love San Diego November 1, 2018

TEXcount Perl script for counting words in L A TEX documents Version 3.1.1

Java Multilingual Elementary Tool

ISO/IEC JTC 1/SC 2 N 3332/WG2 N 2057

Proposed Update. Unicode Standard Annex #11

This manual describes utf8gen, a utility for converting Unicode hexadecimal code points into UTF-8 as printable characters for immediate viewing and

Japanese utf 8 font. Japanese utf 8 font.zip

The Adobe-CNS1-6 Character Collection

The MIME name as defined in IETF RFCs. This includes all "iso-"s.

COM Text User Manual

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval

This PDF file is an excerpt from The Unicode Standard, Version 4.0, issued by the Unicode Consortium and published by Addison-Wesley.

COSC 243 (Computer Architecture)

1 Definitions for the LCY encoding

Easy-to-see Distinguishable and recognizable with legibility. User-friendly Eye friendly with beauty and grace.

Coordination! As complex as Format Integration!

TEXcount Perl script for counting words in L A TEX documents Version 3.0

Attacking Internationalized Software

NAME mendex Japanese index processor

ISO/IEC and the Construction of the Circumpacif is Documents Information Network

This PDF file is an excerpt from The Unicode Standard, Version 4.0, issued by the Unicode Consortium and published by Addison-Wesley.

Infrastructure for High-Quality Arabic

Lecture 25: Internationalization. UI Hall of Fame or Shame? Today s Topics. Internationalization Design challenges Implementation techniques

ISO/IEC JTC/1 SC/2 WG/2 N2095

The Unicode Standard Version 12.0 Core Specification

Picsel epage. PowerPoint file format support

EMu Documentation. Unicode in EMu 5.0. Document Version 1. EMu 5.0

Using non-latin alphabets in Blaise

Proposed Update Unicode Standard Annex #11 EAST ASIAN WIDTH

CS144: Content Encoding

The Use of Unicode in MARC 21 Records. What is MARC?

ICANN IDN TLD Variant Issues Project. Presentation to the Unicode Technical Committee Andrew Sullivan (consultant)

The Unicode Standard Version 10.0 Core Specification

1 The Cyrillic font encodings: T2A, T2B, T2C, and X2

Attacking Internationalized Software

Developping of Character Object Technology with Character Databases

ISO/IEC JTC 1/SC 2 N 3354

This proposal is limited to the addition and rearrangement of some of the Korean character part of ISO/IEC (UCS2).

Development of. TeXShop. - The Past and the Future Yusuke Terada. Tetsuryokukai (鉄緑会)

Search Engines. Information Retrieval in Practice

The newunicodechar package

ISO/IEC INTERNATIONAL STANDARD

Introduction 1. Chapter 1

L2/ Title: Summary of proposed changes to EAW classification and documentation From: Asmus Freytag Date:

ATypI : TypeTech Forum Lissabon OpenType Status Dr. Jürgen Willrodt Dr. OpenType Status 2006

CID-Keyed Font Technology Overview

Universal Acceptance Technical Perspective. Universal Acceptance

Princeton University. Computer Science 217: Introduction to Programming Systems. Data Types in C

WinPOS system. Co., ltd. WP-K837 series. Esc/POS Command specifications Ver.0.94

To the BMP and beyond!

General Structure 2. Chapter Architectural Context

A Framework for Multilingual Searching and Meta-information Extraction

Arabic document composition with T E X

The process of preparing an application to support more than one language and data format is called internationalization. Localization is the process

Extension of VHDL to support multiple-byte characters

Easy-to-use Chinese MTEX Suite Hongbin Ma School of Automation Beijing Instititue of Technology March 14, 2012 Beijing, China

Living Specification Last Updated 4 May 2012

Ideographic Variation Sequences

preliminary draft, June 15, :57 preliminary draft, June 15, :57

PrecisionID ITF Barcode Fonts User Manual

Extended Character Sets for UCAS Systems

Digital Imaging and Communications in Medicine (DICOM) Supplement 9 Multi-byte Character Set Support

OES Cross-Platform Libraries (XPlat) for Linux

CSS3 Text Extensions. 1 Summary. 2 Contents. Michel Suignard. Microsoft Corporation

Picsel epage. Word file format support

Category: Informational 1 April 2005

Legacy Gaiji Solutions & SING

User s Guide: Advanced Functions

[MS-UCODEREF]: Windows Protocols Unicode Reference. Intellectual Property Rights Notice for Open Specifications Documentation

Transcription:

uptex Unicode version of ptex with CJK extensions Takuji Tanaka uptex project Oct 26, 2013 Takuji Tanaka (uptex project) uptex Unicode version of ptex with CJK extensions Oct 26, 2013 1 / 42

Outline / Outline / (1) Introduction (2) Unicodization / Unicode Japanese / CJK / / with European languages / world languages / (3) Imprementation / Unicodization / Unicode \kcatcode set3 (4) uptex vs. Ω, X TEX,... (5) Present & future / E Takuji Tanaka (uptex project) uptex Unicode version of ptex with CJK extensions Oct 26, 2013 2 / 42

Part I Introduction Takuji Tanaka (uptex project) uptex Unicode version of ptex with CJK extensions Oct 26, 2013 3 / 42

Introduction ptex/platex ASCII ptex/pl A TEX It s great: High quality Japanese typesetting incl. vertical writing, Japanese hyphenation,... Japanese standard TEX/L A TEX Strong support by environment DVIware, packages, macros, softwares, books,... but has weakness: Japanese local 8bit Latin/Chinese/Korean are not available Limited character set by legacy encodings (Shift_JIS, EUC-JP) Takuji Tanaka (uptex project) uptex Unicode version of ptex with CJK extensions Oct 26, 2013 4 / 42

Introduction Motivation Motivation Support wider character set of Japanese by Unicode Support babel by switching Latin CJK tokens Support Chinese/Korean Keep quality & environment of ptex Takuji Tanaka (uptex project) uptex Unicode version of ptex with CJK extensions Oct 26, 2013 5 / 42

Introduction Feature Feature of uptex/upl A TEX (1) High quality CJK typesetting based on ptex/pl A TEX (2) Compatible with ptex/pl A TEX (3) Unicode / UTF-8 (4) Switching Latin (12bit) / CJK (29bit) tokens (5) CJK with Babel (Latin/Cyrillic/Greek... ) (6) Over BMP incl. SIP (U+2xxxx) Takuji Tanaka (uptex project) uptex Unicode version of ptex with CJK extensions Oct 26, 2013 6 / 42

Part II Unicodization / Unicode Takuji Tanaka (uptex project) uptex Unicode version of ptex with CJK extensions Oct 26, 2013 7 / 42

Unicodization / Unicode Unicodization / Unicode Unicodization / Unicode Strategies of Unicodization (1) Unicodize only IO Ex: \usepackage[utf8]{inputenc} (2) Imprement Unicode functions Ex: X TEX E (3) Comromise uptex: Intenal: Unicodize only CJK, IO: Fully Unicodize Takuji Tanaka (uptex project) uptex Unicode version of ptex with CJK extensions Oct 26, 2013 8 / 42

Unicodization / Unicode Partial Unicodization / Unicode Partial Unicodization / Unicode TEX ptex uptex 7bit Latin azaz azaz azaz Latin 8bit Latin æœæœ æœæœ inputenc гдгд гдгд Japanese JIS X 0208 Unicode CK Unicode ptex, uptexconsists of two parts (1) As same as original TEX (2) ptex JIS X 0208, uptex Unicode Takuji Tanaka (uptex project) uptex Unicode version of ptex with CJK extensions Oct 26, 2013 9 / 42

Japanese / New JIS / JIS New JIS : JIS X 0213 uptex treats new JIS X 0213 (over JIS X 0208) Takuji Tanaka (uptex project) uptex Unicode version of ptex with CJK extensions Oct 26, 2013 11 / 42

Japanese / Characters out of JIS / JIS Characters out of JIS / JIS source over JIS X 0213 (new JIS) output Platform dependent characters are now in Unicode Takuji Tanaka (uptex project) uptex Unicode version of ptex with CJK extensions Oct 26, 2013 13 / 42

CJK / basis Chinese/Japanese/Korean \schrm : \tchrm : \jpnrm : \korrm : source : : : : output Takuji Tanaka (uptex project) uptex Unicode version of ptex with CJK extensions Oct 26, 2013 15 / 42

CJK / glyphs Difference of glyphs among CJK / CJK Simplified Chinese Traditional Chinese Japanese Korean Takuji Tanaka (uptex project) uptex Unicode version of ptex with CJK extensions Oct 26, 2013 16 / 42

CJK / end-of-line end-of-line Please give me beer.. Please give me beer. (treated as space) (ignored) (ignored). (treated as space) Takuji Tanaka (uptex project) uptex Unicode version of ptex with CJK extensions Oct 26, 2013 18 / 42

CJK / control words Control word by CJK characters \def\ {% \number\year % \number\month % \number\day % } Today: \ Today: 2013 10 26 Takuji Tanaka (uptex project) uptex Unicode version of ptex with CJK extensions Oct 26, 2013 20 / 42

CJK / \usepackage[uplatex,...]{otf}... Adobe-Korea1-1:\\ \CIDK{8322}\CIDK{8588}... Adobe-Japan1-5:\\ \ \ \ajrecycle{10}% \ajlig{ }% \ajpict{ }\\ \ajmaru{1}... Japanese-OTF package Japanese-OTF package Adobe-Korea1-1: Adobe-Japan1-5: Japanese-OTF package also supports CK. Takuji Tanaka (uptex project) uptex Unicode version of ptex with CJK extensions Oct 26, 2013 22 / 42

CJK / Unification / Unification / standard full-width Cyrillic Ж U+0416 U+0416 Latin W U+0057 U+FF37 No full-width code in Greek, Cyrillic in Unicode. It is a barrier to Unicodize Japanese softs. uptex can treat full-width Greek, Cyrillic by markup. Takuji Tanaka (uptex project) uptex Unicode version of ptex with CJK extensions Oct 26, 2013 23 / 42

with European languages / inputenc inputenc & UTF-8 \usepackage[utf8]{inputenc} \usepackage[t1]{fontenc} \kcatcode ç=15... But aren t Kafka s Schloß and Æsop s Œuvres often naïve vis-à-vis the dæmonic phœnix s official rôle in fluffy soufflés? But aren t Kafka s Schloß and Æsop s Œuvres often naïve vis-à-vis the dæmonic phœnix s official rôle in fluffy soufflés? Takuji Tanaka (uptex project) uptex Unicode version of ptex with CJK extensions Oct 26, 2013 25 / 42

with European languages / Babel Babel \usepackage[french,...]% {babel}... \selectlanguage{english} English... \today... \selectlanguage{russian} Русский... \today \selectlanguage{japanese}... \today English October 26, 2013 Français 26 octobre 2013 Deutsch 26. Oktober 2013 Czech 26. října 2013 Русский 26 октября 2013 г. 2013 10 26 Takuji Tanaka (uptex project) uptex Unicode version of ptex with CJK extensions Oct 26, 2013 27 / 42

with European languages / It s a small world It s a small world uptex can treat CJK, Latin, Cyrillic and Greek. uptex cannot directly treat Arabic, Brahmic,... Takuji Tanaka (uptex project) uptex Unicode version of ptex with CJK extensions Oct 26, 2013 28 / 42

Part III Imprementation / Takuji Tanaka (uptex project) uptex Unicode version of ptex with CJK extensions Oct 26, 2013 29 / 42

Imprementation / Unicodization / Unicode Unicodization / Unicode (1) IO: EUC/SJIS in ptex UTF8 in uptex (ptexenc library) (2) Internal buffer: 16bit in ptex 29bit in uptex (Ref. Omega) (3) Unicodize standard macros, libraries (4) uptex support of DVIWARE Takuji Tanaka (uptex project) uptex Unicode version of ptex with CJK extensions Oct 26, 2013 30 / 42

Imprementation / DVIware DVIware ptetex3+ / Linux W32TeX / Windows dvipdfmx, dvips, xdvi, dvi2tty & DVIOUT are available Takuji Tanaka (uptex project) uptex Unicode version of ptex with CJK extensions Oct 26, 2013 31 / 42

Imprementation / \kcatcode \kcatcode kcat cat control end of kind e.g. code code word line 10 space 15 11 char azaz yes as space 12 other char (.!? no as space 16 Kanji yes ignore 17 Kana yes ignore 18 CJK symbol no ignore 19 Hangul yes as space If \kcatcode is 15, the character is treat as Latin and uptex works as same as original TEX. Takuji Tanaka (uptex project) uptex Unicode version of ptex with CJK extensions Oct 26, 2013 32 / 42

Imprementation / set3 & over BMP set3 & over BMP (JIS2004 includes a lot of CJK Ideograph Extension B) uptex supports SIP (Supplementary Ideograph Plane) U+2xxxx by using DVI command set3. How visionary Knuth is!! Takuji Tanaka (uptex project) uptex Unicode version of ptex with CJK extensions Oct 26, 2013 33 / 42

Part IV uptex vs., X E TEX,... Takuji Tanaka (uptex project) uptex Unicode version of ptex with CJK extensions Oct 26, 2013 34 / 42

uptex vs., X TEX,... uptex vs., X E TEX,... E TEX ptex uptex X TEX Compatibility Latin Japanese Advancedness Multilingual Latin Japanese CK others Integrity (Japanese) Popularity Japan World > > > E Takuji Tanaka (uptex project) uptex Unicode version of ptex with CJK extensions Oct 26, 2013 35 / 42

Part V Present & Future / Takuji Tanaka (uptex project) uptex Unicode version of ptex with CJK extensions Oct 26, 2013 36 / 42

Present & Future / History History Year 1995 ASCII ptex ver.2, platex2e 2007 uptex first release, alpha version 2007 uptex is in W32TeX 2008 e-uptex by Kitagawa-san 2012 uptex 1.00 2012 uptex is in TeX Live 2013 uptex presentation in TUG2013 Takuji Tanaka (uptex project) uptex Unicode version of ptex with CJK extensions Oct 26, 2013 37 / 42

Present & Future / Future Future / Currently, uptex has capability of multilingual (CJK, Latin, Cyrillic, Greek) typesetting. Possible items in the future are: (1) Document classes for Chinese/Korean (Any volunteer?) (2) Babel options for Chinese/Korean (It will be useful in ko.tex etc. Any volunteer?) (3) Does uptex have a potential to be a useful CJK TEX? Takuji Tanaka (uptex project) uptex Unicode version of ptex with CJK extensions Oct 26, 2013 38 / 42

Part VI Appendix / Takuji Tanaka (uptex project) uptex Unicode version of ptex with CJK extensions Oct 26, 2013 39 / 42

Appendix / Latin/CJK tokens Latin/CJK tokens TEX ptex uptex Latin I/O 8bit 7bit 8bit (multibytes) 1byte (multibytes) token charcode 8bit 8bit 8bit catcode 4bit 4bit 4bit CJK I/O EUC etc. UTF-8 8bit 8bit 2bytes 2 4bytes token charcode 16bit 24bit kcatcode 5bit Latin/CJK classification fixed customizable inputenc OK NG OK Babel full partial full : with inputenc Takuji Tanaka (uptex project) uptex Unicode version of ptex with CJK extensions Oct 26, 2013 40 / 42

Appendix / Encoding Character encoding in uptex Latin CJK TEX compatible uptex extended <256 BMP over BMP comment.tex /.aux UTF8 I/O buffer 1byte 2 3bytes 4bytes token 12bit 29bit with (k)catcode set1 set2 set3.dvi /.vf T1 etc. UCS2 UTF32 8bit 16bit 24bit.tfm T1 etc. UCS2 treated as Kanji 8bit 16bit jfm for CJK.ps / CMap T1 etc. UCS2 UTF16 8bit 16bit 2 16bit Takuji Tanaka (uptex project) uptex Unicode version of ptex with CJK extensions Oct 26, 2013 41 / 42

Appendix / kcatcode kcatcode kcat cat control end of kind e.g. code code word line 10 space 15 11 char azaz yes as space 12 other char (.!? no as space 16 Kanji yes ignore 17 Kana yes ignore 18 CJK symbol no ignore 19 Hangul yes as space Takuji Tanaka (uptex project) uptex Unicode version of ptex with CJK extensions Oct 26, 2013 42 / 42