Representing Characters and Text

Similar documents
UTF and Turkish. İstinye University. Representing Text

Representing Characters, Strings and Text

2011 Martin v. Löwis. Data-centric XML. Character Sets

2007 Martin v. Löwis. Data-centric XML. Character Sets

Character Encodings. Fabian M. Suchanek

Digital Logic. The Binary System is a way of writing numbers using only the digits 0 and 1. This is the method used by the (digital) computer.

IT101. Characters: from ASCII to Unicode

Chapter 4: Computer Codes. In this chapter you will learn about:

Unicode and Non Unicode Printing with the Swiss 721 Font

Using non-latin alphabets in Blaise

2a. Codes and number systems (continued) How to get the binary representation of an integer: special case of application of the inverse Horner scheme

Chapter 11 : Computer Science. Information Representation. Class XI ( As per CBSE Board) New Syllabus

Can R Speak Your Language?

You 2 Software

CS144: Content Encoding

Chapter 7. Representing Information Digitally

CMPS 10 Introduction to Computer Science Lecture Notes

Programming Concepts and Skills. ASCII and other codes

Administrative Notes February 9, 2017

Coding Theory. Networks and Embedded Software. Digital Circuits. by Wolfgang Neff

Routine Routine/ Minor/ Moderate/ Serious / Major/ Critical

COM Text User Manual

1.1. INTRODUCTION 1.2. NUMBER SYSTEMS

Computer Science Applications to Cultural Heritage. Introduction to computer systems

SAPGUI for Windows - I18N User s Guide

LBSC 690: Information Technology Lecture 05 Structured data and databases

UNIT 7A Data Representation: Numbers and Text. Digital Data

COSC 243 (Computer Architecture)

Tex with Unicode Characters

The Unicode Standard Version 11.0 Core Specification

Computer Science 1001.py. Lecture 19, part B: Characters and Text Representation: Ascii and Unicode

LING 388: Computers and Language. Lecture 5

Computer Science 1001.py. Lecture 19a: Generators continued; Characters and Text Representation: Ascii and Unicode

Google Search Appliance

Review of HTML. Ch. 1

Chapter 3. Information Representation

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval

Intermediate Programming & Design (C++) Notation

Number codes nibble byte word double word

Unicode and its discontents. Jeremy G. Kahn Machine Translation reading group 5 May 2008

Princeton University. Computer Science 217: Introduction to Programming Systems. Data Types in C

Practical character sets

Representing Information. Bit Juggling. - Representing information using bits - Number representations. - Some other bits - Chapters 1 and 2.3,2.

Picsel epage. PowerPoint file format support

Wordman s Production Corner

Recent developments in IT Standards of interest to translators

CP-147 Date 1999/01/30. Name of Standard: PS 3.3,

by Martin J. Dürst, University of Zurich (1997) Presented by Marvin Humphrey for Papers We Love San Diego November 1, 2018

CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 02, SPRING 2013

PrecisionID ITF Barcode Fonts User Manual

IT 1204 Section 2.0. Data Representation and Arithmetic. 2009, University of Colombo School of Computing 1

Note 8. Internationalization

BrianHetrick.com Application Note AN-1.0 A Latin-1 and Latin-3 Characters US Keyboard Layout for Microsoft Windows

Representing text on the computer: ASCII, Unicode, and UTF 8

A tutorial on character code issues

Chapter 9: Data Transmission

umber Systems bit nibble byte word binary decimal

Computer Organization and Assembly Language. Lab Session 4

text2reach2 SMS API Sep 5, 2013 v1.1 This document describes application interface (API) between SMS service provider (SP) and SMS gateway (SMSGW).

Memory Addressing, Binary, and Hexadecimal Review

Announcement. (CSC-3501) Lecture 3 (22 Jan 2008) Today, 1 st homework will be uploaded at our class website. Seung-Jong Park (Jay)

Friendly Fonts for your Design

Chapter 3 DATA REPRESENTATION

1.1 Information representation

CS-201 Introduction to Programming with Java

ATypI Hongkong Development of a Pan-CJK Font

ICANN IDN TLD Variant Issues Project. Presentation to the Unicode Technical Committee Andrew Sullivan (consultant)

Using sticks to count was a great idea for its time. And using symbols instead of real sticks was much better.

Japanese utf 8 font. Japanese utf 8 font.zip

Numbers and Representations

Picsel epage. Word file format support

Source coding and compression

Version 5.5. Multi-language Projects. Citect Pty Ltd 3 Fitzsimmons Lane Gordon NSW 2072 Australia

Extended Character Sets for UCAS Systems

Coordination! As complex as Format Integration!

1 Lithuanian Lettering

ESCAPE SEQUENCE G0: ESC 02/08 04/13 C0: C1: NAME Extended African Latin alphabet coded character set for bibliographic information interchange

This proposal is limited to the addition and rearrangement of some of the Korean character part of ISO/IEC (UCS2).

CS 137 Part 6. ASCII, Characters, Strings and Unicode. November 3rd, 2017

Information technology Keyboard layouts for text and office systems. Part 9: Multi-lingual, multiscript keyboard layouts

Redrabbit Cloud-based Communications Platform SMS APIs

Multimedia Data. Multimedia Data. Text Vector Graphics 3-D Vector Graphics. Raster Graphics Digital Image Voxel. Audio Digital Video

Terry Carl Walker 1739 East Palm Lane Phoenix, AZ United States

Using Unicode with MIME

Characters, Strings and Text Processing in Java

LING 408/508: Computational Techniques for Linguists. Lecture 3

EMu Documentation. Unicode in EMu 5.0. Document Version 1. EMu 5.0

Problems with FrameMaker 7 on MS Windows and non-western languages

OBJECTIVES After reading this chapter, the student should be able to:

UNIT 2 NUMBER SYSTEM AND PROGRAMMING LANGUAGES

Digital Imaging and Communications in Medicine (DICOM) Supplement 9 Multi-byte Character Set Support

DATA REPRESENTATION. By- Neha Tyagi PGT CS KV 5 Jaipur II Shift, Jaipur Region. Based on CBSE curriculum Class 11. Neha Tyagi, KV 5 Jaipur II Shift

ECE260: Fundamentals of Computer Engineering

Navigating the pitfalls of cross platform copies

Numeral Systems (Part II)

[301] Bits and Memory. Tyler Caraza-Harter

Blaise Team IBUC, April 24, 2012

Computers, Numbers, and Text

Stream Ciphers. Çetin Kaya Koç Winter / 13

CPSC 301: Computing in the Life Sciences Lecture Notes 16: Data Representation

Transcription:

Representing Characters and Text cs4: Computer Science Bootcamp Çetin Kaya Koç cetinkoc@ucsb.edu Çetin Kaya Koç http://koclab.org Winter 2018 1 / 28

Representing Text Representation of text predates the use of computers for text Text representation was needed for communication equipment One particular commonly used communication equipment was teleprinter (also called as teletypewriter, teletype or tty) Çetin Kaya Koç http://koclab.org Winter 2018 2 / 28

Representing Text Teleprinter was an electromechanical typewriter that can be used to send and receive typed messages over various types of communications channels Teleprinters were adapted to provide a user interface to early mainframe computers and minicomputers They were used for sending typed data to the computer and printing the response from the computer Çetin Kaya Koç http://koclab.org Winter 2018 3 / 28

Representing Text Computers can only understand numbers Çetin Kaya Koç http://koclab.org Winter 2018 4 / 28

Teleprinters and ASCII Thus, we need a number representation of characters: a or @ A number representation for the English alphabet was developed in 1967 by the American Standards Association (ASA) in the US, which remains in use today This representation is called ASCII which stands for American Standard Code for Information Interchange Çetin Kaya Koç http://koclab.org Winter 2018 5 / 28

ASCII ASCII was originally based on the English alphabet ASCII encodes 128 characters into 7-bit binary integers 33 of these 128 are non-printing control characters Çetin Kaya Koç http://koclab.org Winter 2018 6 / 28

American Standard Code for Information Interchange a is encoded as (110 0001) 2 = (61) 16 = (97) 10 @ is encoded as (100 0000) 2 = (40) 16 = (64) 10 Çetin Kaya Koç http://koclab.org Winter 2018 7 / 28

American Standard Code for Information Interchange The first 32 (0 to 31) are control codes For example: Code 10 (0A) represents the line feed function (which causes the printer to advance its paper) Code 8 (08) represents backspace Code 127 (FF) represents delete Code 32 (20) represents (Space) Çetin Kaya Koç http://koclab.org Winter 2018 8 / 28

American Standard Code for Information Interchange The codes 33-47, 58-64, and 91-96 are printable characters The codes 48-57 are numbers: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 The codes 65-90 are upper case letters: A, B, C, D, E, F, G, H, I,... The codes 97-122 are lower case letters: a, b, c, d, e, f, g, h, i,... Çetin Kaya Koç http://koclab.org Winter 2018 9 / 28

American Standard Code for Information Interchange Çetin Kaya Koç http://koclab.org Winter 2018 10 / 28

American Standard Code for Information Interchange Normally a 7-bit code is placed inside an 8-bit (1-byte) binary number, keeping the most significant (leftmost) bit 0 Therefore, a simple text file containing the word hello will have 5 bytes for the 5 letters There may also be 1 byte space for the LF (line feed) character at the end of each line Demonstrate file5.txt and file6.txt Çetin Kaya Koç http://koclab.org Winter 2018 11 / 28

Extensions of ASCII Soon after ASCII was introduced, Europeans became very sad Çetin Kaya Koç http://koclab.org Winter 2018 12 / 28

Extensions of ASCII Soon after ASCII was introduced, Europeans became very sad They wanted codes for their alphabet characters Çetin Kaya Koç http://koclab.org Winter 2018 12 / 28

Extensions of ASCII They needed code for representation of characters found in Western European languages, such as ç ü é â ä å... Çetin Kaya Koç http://koclab.org Winter 2018 13 / 28

Extensions of ASCII Since we have 1 byte (8 bits), we have room for 256 characters This gives us 8-bit ASCII, codes from 00 (hex) to FF (hex) This created Extended ASCII Code 128 Ç Code 129 ü Code 130 é Code 135 ç Code 148 ö Code 153 Ö Code 154 Ö Extended ASCII does not have codes for ǧ Ǧ ı İ ş Ş Çetin Kaya Koç http://koclab.org Winter 2018 14 / 28

Extensions of ASCII Unfortunately, everyone had their idea of what is needed, and many standards bodies developed different variants of ASCII Different operating systems, game consoles, and computers have designed their own ASCII extensions However all of these variants include 7-bit ASCII (sometimes called US-ASCII) as their subset The International Organization for Standardization (ISO) standardized 8-bit codes and named is ISO 8859 in 1998 These form different sets of 8-bit codes Çetin Kaya Koç http://koclab.org Winter 2018 15 / 28

ISO 8859 The standard ISO 8859 covers an almost complete list of Western European languages It supports an extended list of languages Latin-1 (Western European languages) Latin-2 (Non-Cyrillic Central and Eastern European languages) Latin-3 (Southern European languages and Esperanto) Latin-5 (Turkish) Latin-6 (Northern European and Baltic languages) 8859-5 (Cyrillic) 8859-6 (Arabic) 8859-7 (Greek) 8859-8 (Hebrew) Not all software can parse ISO-8859 character sets Çetin Kaya Koç http://koclab.org Winter 2018 16 / 28

Further Extension of ASCII Moreover, there are many more languages and scripts in the world Consider: Arabic, Armenian, Hebrew, Chinese, Japanese, Korean We need a much more comprehensive encoding system Çetin Kaya Koç http://koclab.org Winter 2018 17 / 28

Unicode Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world s writing systems Unicode Goal: One encoding for all scripts of the world! Unicode contains a repertoire of more than 110,000 characters covering 123 scripts and multiple symbol sets Unicode covers almost all scripts in current use today Unicode defines 1,114,112 code points, in the range 0 to 10FFFF Çetin Kaya Koç http://koclab.org Winter 2018 18 / 28

Unicode Unicode can be implemented by different character encodings The most commonly used encoding is UTF-8 UTF stands for Unicode Transformation Format UTF-8 uses one byte for any ASCII character, all of which have the same code values in both UTF-8 and ASCII encoding, and up to four bytes for other characters Of more than a million code points, about 100,000 are assigned Most assignments are in the first 65,536 code points Çetin Kaya Koç http://koclab.org Winter 2018 19 / 28

Unicode Properties Every letter in every alphabet is assigned a number by the Unicode consortium which is written like this: U+0639 This number is called a code point The U+ means Unicode and the numbers are hexadecimal U+0639 is the Arabic letter Ain The English letter A would be U+0041. There is no real limit on the number of letters that Unicode can define and they go beyond 65,536 Çetin Kaya Koç http://koclab.org Winter 2018 20 / 28

Unicode Examples English text looks exactly the same in UTF-8 as it does in ASCII Specifically, Hello will be stored as 48 65 6C 6C 6F, which is the same as it is stored in ASCII The string Hello in Unicode corresponds to five code points: U+0048 U+0065 U+006C U+006C U+006F The encoding scheme determines how these bytes are to be stored As we said, the most commonly used encoding is UTF-8 On the other hand, Arabic, Armenian or other letters will be represented according to their (mostly 2-byte) Unicode definitions, found in http://www.unicode.org/charts Çetin Kaya Koç http://koclab.org Winter 2018 21 / 28

Unicode 7-bit ASCII in Python In Python, all strings are stored in UTF-8 Unicode 7-bit ASCII (US-ASCII) Example: Çetin Kaya Koç http://koclab.org Winter 2018 22 / 28

Unicode 8-bit ASCII in Python Çetin Kaya Koç http://koclab.org Winter 2018 23 / 28

Unicode Arabic Çetin Kaya Koç http://koclab.org Winter 2018 24 / 28

Unicode Armenian Çetin Kaya Koç http://koclab.org Winter 2018 25 / 28

Unicode CJK - Chinese Japanese Korean Çetin Kaya Koç http://koclab.org Winter 2018 26 / 28

Unicode CJK - Chinese Japanese Korean Çetin Kaya Koç http://koclab.org Winter 2018 27 / 28

Unicode CJK - Chinese Japanese Korean Çetin Kaya Koç http://koclab.org Winter 2018 28 / 28