Dynamic Glyph Generation Based on variable length encoding

Similar documents
Universal Multiple-Octet Coded Character Set

Assignment 3 Functions, Graphics, and Decomposition

Worksheet - Storing Data

Chapter One Modifying Your Fonts

The FontFactory Window. Introduction. Installation. The File Menu. Open

VARIABLES. Aim Understanding how computer programs store values, and how they are accessed and used in computer programs.

ATypI Hongkong Development of a Pan-CJK Font

Introduction To Inkscape Creating Custom Graphics For Websites, Displays & Lessons

COPYRIGHTED MATERIAL. An Introduction to Computers That Will Actually Help You in Life. Chapter 1. Memory: Not Exactly 0s and 1s. Memory Organization

Adobe Illustrator CS Design Professional GETTING STARTED WITH ILLUSTRATOR

Reply to L2/10-327: Comments on L2/10-280, Proposal to Add Variation Sequences... 1

There we are; that's got the 3D screen and mouse sorted out.

User Interface Design. Interface Design 4. User Interface Design. User Interface Design. User Interface Design. User Interface Design

Direct Rendering. Direct Rendering Goals

Topics. Hardware and Software. Introduction. Main Memory. The CPU 9/21/2014. Introduction to Computers and Programming

The Unicode Standard Version 11.0 Core Specification

GUI-based Chinese Font Editing System Using Font Parameterization Technique

Notes on Turing s Theorem and Computability

Parallelism and Concurrency. COS 326 David Walker Princeton University

A Heuristic Search Approach to Chinese Glyph Generation Using Hierarchical Character Composition. 1. Introduction

COSC 243 (Computer Architecture)

Adobe Graphics Software

Scalable Vector Graphics (SVG) vector image World Wide Web Consortium (W3C) defined with XML searched indexed scripted compressed Mozilla Firefox

Adobe Photoshop Sh S.K. Sublania and Sh. Naresh Chand

Princeton University. Computer Science 217: Introduction to Programming Systems. Data Types in C

Image Types Vector vs. Raster

Can R Speak Your Language?

Vision. OCR and OCV Application Guide OCR and OCV Application Guide 1/14

Creating Digital Illustrations for Your Research Workshop III Basic Illustration Demo

The MathType Window. The picture below shows MathType with all parts of its toolbar visible: Small bar. Tabs. Ruler. Selection.

Designing & Developing Pan-CJK Fonts for Today

Programming in C++ Prof. Partha Pratim Das Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Proposal to Add Four SENĆOŦEN Latin Charaters

DRAWING VECTOR VS PIXEL SHAPES IN PHOTOSHOP CS6

1. New document, set to 5in x 5in, no bleed. Color Mode should be default at CMYK. If it s not, changed that when the new document opens.

Keyboards for inputting Chinese Language: A study based on US Patents

LE840/LE850. Printer Setting Tool Manual Technical Reference

Script for Interview about LATEX and Friends

John W. Jacobs Technology Center 450 Exton Square Parkway Exton, PA Introduction to

Lou Burnard Consulting

Chapter 10 Working with Graphs and Charts

Layman s comments on the encoding proposal Khitan small script

Printing Envelopes in Microsoft Word

Game keystrokes or Calculates how fast and moves a cartoon Joystick movements how far to move a cartoon figure on screen figure on screen

6 Using Adobe illustrator: advanced

Proposal to Encode the Ganda Currency Mark for Bengali in the BMP of the UCS

Micriµm, Inc. Using a GUI in an Embedded System

Performance Analysis and Culling Algorithms

Princeton University Computer Science 217: Introduction to Programming Systems The C Programming Language Part 1

FOUNDATION IN. GRAPHIC DESIGN with ADOBE APPLICATIONS. LESSON 1 Summary Notes

Nano-Lisp The Tutorial Handbook

Week - 04 Lecture - 01 Merge Sort. (Refer Slide Time: 00:02)

Three OPTIMIZING. Your System for Photoshop. Tuning for Performance

Word: Print Address Labels Using Mail Merge

Let s Make a Front Panel using FrontCAD

The Unicode Standard Version 10.0 Core Specification

Proposed Update. Unicode Standard Annex #11

Form number: N2352-F (Original ; Revised , , , , , , ) N2352-F Page 1 of 7

Repetition Through Recursion

Easy-to-see Distinguishable and recognizable with legibility. User-friendly Eye friendly with beauty and grace.

Photoshop Introduction to The Shape Tool nigelbuckner This handout is an introduction to get you started using the Shape tool.

Table Basics. The structure of an table

Ray Casting of Trimmed NURBS Surfaces on the GPU

Adobe. SING Technology. Solving the missing character problem. Thomas Phinney Program Manager Fonts & SING Technologies 28 September 2006

Data Representation From 0s and 1s to images CPSC 101

Building Source Han Sans & Noto Sans CJK

OpenType Font by Harsha Wijayawardhana UCSC

Mako is a multi-platform technology for creating,

Introducing Adobe Photoshop CC

SAPGUI for Windows - I18N User s Guide

such a manner that we are able to understand, grasp and grapple with the problem at hand in a more organized fashion.

Corel Draw 11. What is Vector Graphics?

Chapter 1. Getting to Know Illustrator

Documentation English v1

C H A P T E R 1. Introduction to Computers and Programming

Introduction to Computer Graphics

OPTIMIZING PDFS WITH ACROBAT PRO 8

LESSON 1. A C program is constructed as a sequence of characters. Among the characters that can be used in a program are:

Variables and Data Representation

PDF417 Encoder. Programmer s Manual

How to...create a Video VBOX Gauge in Inkscape. So you want to create your own gauge? How about a transparent background for those text elements?

CS 563 Advanced Topics in Computer Graphics QSplat. by Matt Maziarz

Diploma in GRAPHIC DESIGN. - Part 1 LESSON 1

Ai Adobe. Illustrator. Creative Cloud Beginner

ArcGIS Runtime: Maximizing Performance of Your Apps. Will Jarvis and Ralf Gottschalk

2/12/17. Goals of this Lecture. Historical context Princeton University Computer Science 217: Introduction to Programming Systems

PDF PDF PDF PDF PDF internals PDF PDF

(Refer Slide Time: 00:01:30)

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

USE CASE. Collect CLOSED CASE FEEDBACK. Salesforce Workflow. Clicktools Deployment TWO DEPLOYMENT APPROACHES. The basic activity flow goes like this:

Output in Window Systems and Toolkits

Binary Trees Case-studies

CREATING AN ILLUSTRATION WITH THE DRAWING TOOLS

1. Introduction. 2. Proposed Characters Block: Latin Extended-E Historic letters for Sakha (Yakut)

Compression. storage medium/ communications network. For the purpose of this lecture, we observe the following constraints:

State of CJK issues of LibreOffice, m TIRANA 27 Sept.

L2/ Proposal to encode archaic vowel signs O OO for Kannada. 1. Thanks. 2. Introduction

Next-Generation Graphics on Larrabee. Tim Foley Intel Corp

CS02b Project 2 String compression with Huffman trees

Binary Representation. Jerry Cain CS 106AJ October 29, 2018 slides courtesy of Eric Roberts

Transcription:

Kyoto University 21st Century COE Program Dynamic Glyph Generation Based on variable length encoding schema Yap Cheah Shen 1) Abstract About 20 years ago, Prof. Hsieh Ching-Chun from Academia Sinica proposed a descriptive method for encoding Hanzi (Kanji), as a fundamental guideline to deal with the missing Chinese character problem. As we ( eforth Techology, Inc ) are trying to develop an OS for our embedded CPU, we have a chance to design CJK environment from scratch. We soon found out the key part of the system is a module which takes glyph expression as input, and generates glyph outline as output. It is called dynamic glyph generation. The term dynamic emphasizes glyph outline is generated on the fly, in contrast to ordinary static font file approach. 1 Morpheme: Latin vs. Han A Morpheme is the smallest meaningful unit in the grammar of a language. In Indo-European languages, it is word separated by space 1. In Chinese, there are no spaces, the morpheme is Hanzi. Morpheme is the basic unit which map to the real world ideas. No doubt the set of morphemes will keep changing and expanding from time to time, from location to location. Changing means a word maps to different 2 ideas, and expanding means new 3 word is created. 2 Latin text encoding Numerous types of protein molecule are constructed from only 20 amino acids. Likewise, all English words can be represented by sequence of alphabets. The natural encoding unit for Indo-European languages is alphabet.this is how ASCII works, encoding alphabets and symbols, not words. To encode words as sequences of alphabets is neither space nor time efficient. Fixed-bit or Huffman encoding is more compact and requires less computing effort. We choose the less efficient way because of extensibility, to ensure the encoding system remain unchanged, no matter how many new words are introduced. 1) eforth Techology. Inc. 1 To be more precise, unladylike is a word, but consisting of 3 morpheme: un-, lady and like. We don t have to worry about the linguistic definition in our discussions. 2 For example, current maps to ideas of con-temporary and flow of water. Later, it refers to flow of electricity. 3 Brands, Abbreviations, etc. 3 Missing Characters in Chinese Text Hanzi are in an open-set, theoretically, historically and practically. Therefore, Hanzi can never be totally encoded with fix-byte, existing encoding schema, including Unicode, which does not respect the nature of Chinese language and made a wrong approach. The consequence is bound to be miserable. Unending loop of assigning code point to characters, waiting for OS vendor to release patches or updates, waiting for new input method table, buying new fonts, etc. The users suffer in every stage. But, vendors are happy by generating profit out of the process. Surely this is not an ideal way of dealing with CJK texts. Because of the wrong design, users and the society keep suffering and paying high cost. How can we come out of this misery? 4 Solution Unlike alphabet, the basic unit of Hanzi is more subtle. Most Hanzi 4 are form by two or more parts or components. The number of component is quite limited, and some have really high use frequency. Prof. Hsieh has done a lot of work 5 on this. Although most components are Hanzi themselves, but it is not necessary so. Components show irregular distribution. A small number of components occur quite frequently, but many components 4 More than 85% our of 70000 Hanzi are composite. 5 Please visit http://www.sinica.edu.tw/ cdp/paper/ 1996/19961005 1.htm

only happen once or twice. Chances are rare that new components are needed for new Hanzi. To simplify the system, we pick about 800 components as a basic set. Basic strokes are used for describing components which are not included. We come out with a closed set of basic components and strokes as encoding units. Because Hanzi is two dimensional, we need three operators for describing how component are joined: horizontal, vertical, and enclosing. We add a new shielding operator, which hide strokes. It is useful for describing tabooed 6 Hanzi. Prefix notation is used, as suggested by Unicode IDS specification. A CJK environment can be divided into three aspect: inputting, displaying/printing, and storing/interchanging. In ordinary CJK environment, a fix numeric value is given to each Hanzi, which is call the character code. Character code is difficult for human being to memorized. The most logical method is to split Hanzi into phoneme or components, and map each phoneme and component to a key. Hanzi can be represented as sequences of keystrokes. Different input method has different way of mapping. But, the principle is still the same. Inputting of Hanzi is a table-checking problem. That s why even there are hundreds of input methods, none of them can input a Hanzi which has no character code. The glyph of a Hanzi must be pre-designed and store in a font file, either in outline or raster format. The glyph for a certain Hanzi is accessed by a character code. Again, if a character has no character code, no matter how many font types are installed to your computer, you can never get the glyph. Someone may argue that we can assign missing character in EUDC (end-user defined character) area and use font designing tool to create glyph for the Hanzi. This is not a good solution, because other people don t acknowledge your assignment to the Hanzi, and worse if he has different Hanzi assign to the same character code. Another problem is the EUDC area is limited, and it is never enough to hold all Hanzi and their variants. In ordinary system, a text file is nothing but se- 6 In Chinese tradition, to show respect, name of emperor, saint and father should be written differently, In most cases a stroke is taken out. Which are useful clue for paleology and bibliology. quences of character code. If a Hanzi doesn t have character code, it cannot be represented in a text file. Existing input method and text format are compatible with variable length glyph expression. It is very easy to encode three operators and basic component in existing input method. (In fact, most component-based input methods share the same set of common components.) Texts remain unchanged, because ideographic descriptive sequence (IDS) is no different from normal text stream. We have only one problem left: how to get a glyph which are not in font file? The combination of components is endless, not only all Hanzi can be represented by IDS, but we re free to create unlimited number of new Hanzi, which have not be seen before. Obviously the static font file approach cannot meet this purpose. We need a mechanism to generate glyph from an IDS. This is the idea of Dynamic Glyph Generator. 5 Dynamic Glyph Generator Dynamic Glyph Generator takes IDS or other descriptive sequence as input, namely glyph composing expression proposed by Academia Sinica, and glyph assembling expression proposed by CBETA 7. Dynamic Glyph Generator first decomposes the glyph. The result is the sequence of basic component and their proportion. Basic component is constructed by basic strokes. Strokes are stored as bezier curves. The generator then recursively builds up the full glyph with these information. The output can be raster bitmap of any pixel format, true-type compatible outline, SVG or any other vector format. The generator is very similar to a typesetting program, trying to fit one dimensional stroke sequences into a 2 dimensional square. 6 Implementation The system consists of 3 major parts: Glyph decomposition database: courtesy of Prof. Hsieh from Academia Sinica, Taiwan http://www. 7 Chinese Buddhist Electronic Text Association, http:// www.cbeta.org 94

sinica.edu.tw/ cdp/. Prof. Hsieh and his associates 8 have been working on it for more than 10 years, it is almost impossible for us to start without their work. Outline of strokes and components: Beijing ZhongYi Co. http://www.zhongyicts. com.cn/ is a professional outline font vendor, they provide us very high quality outline data for strokes and components. The eforth system: eforth system is a standalone computing environment. We have full control over the CPU, the OS and the GUI, so that new idea can be implement without any compromise. Our goal is putting everything into a single chip. CJK environment is built into hardware. Users get everything when power is on. No installation and configuration is needed, and no more software incompability issue. 8 Other issues Quality of the glyph: The quality of the generated glyph can never be as good as pre-designed, fine tuned outline, because the system take only 1/20 1/100 space compare to static font approach. Should better quality is needed, common Hanzi 9 can be stored as outline, thus offering better quality and increasing overall performance. Speed of generation: Glyph generation is never as fast as retrieving glyph from font file, but this is no problem for ordinary system, because glyph generation is rare. For handheld device, we have special hardware acceleration in our CPU, e.g, line drawing and polygon fill, which boost the performance dramatically. 7 Integrating into existing OS/GUI Dynamic Glyph Generator is originally design to fit the requirement of a embedded system. It is possible to integrate Dynamic Glyph Generator into existing systems. Here are things to be done: 1. String manipulation library Function to calculate number of characters Function which return characters width. Pattern matching functions. 2. Graphic sub-system Drawing a text line (e.g. ExtTextOut in windows) Text handling widgets 3. Human interface Awareness of glyph expression for mouse caret Awareness of mouse/keyboard selection and delete/backspace. 8 I have to acknowledge Mr.Zhuang Derming, who maintains the database. 9 Study shows that 1000 Hanzi cover 90% of normal text. 95

96

97