Representing text on the computer: ASCII, Unicode, and UTF 8

Similar documents
UTF and Turkish. İstinye University. Representing Text

Representing Characters and Text

by Martin J. Dürst, University of Zurich (1997) Presented by Marvin Humphrey for Papers We Love San Diego November 1, 2018

Administrative Notes February 9, 2017

IT101. Characters: from ASCII to Unicode

(Refer Slide Time: 00:23)

LBSC 690: Information Technology Lecture 05 Structured data and databases

Character Encodings. Fabian M. Suchanek

Course Schedule. CS 221 Computer Architecture. Week 3: Plan. I. Hexadecimals and Character Representations. Hexadecimal Representation

Representing Characters, Strings and Text

Chapter 4: Computer Codes. In this chapter you will learn about:

This is a list of questions and answers about Unicode in Perl, intended to be read after perlunitut.

Source coding and compression

Sending s With Sendmail - Part 2

Digital Representation

Chapter 7. Representing Information Digitally

Princeton University. Computer Science 217: Introduction to Programming Systems. Data Types in C

[301] Bits and Memory. Tyler Caraza-Harter

IT 1204 Section 2.0. Data Representation and Arithmetic. 2009, University of Colombo School of Computing 1

Announcement. (CSC-3501) Lecture 3 (22 Jan 2008) Today, 1 st homework will be uploaded at our class website. Seung-Jong Park (Jay)

Python Print Decode Error - Output Not Utf-8

Topic Notes: Bits and Bytes and Numbers

CIF Changes to the specification. 27 July 2011

Textbook. Topic 8: Files and Exceptions. Files. Types of Files

Midterm Exam 2B Answer key

TECkit version 2.0 A Text Encoding Conversion toolkit

HTML is a mark-up language, in that it specifies the roles the different parts of the document are to play.

LESSON 4. The DATA TYPE char

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: Bits and Bytes and Numbers

BEGINNER PHP Table of Contents

Chapter 2. Data Representation in Computer Systems

Should you know scanf and printf?

Chapter 3: Creating and Modifying Text

CS 137 Part 6. ASCII, Characters, Strings and Unicode. November 3rd, 2017

CMPS 10 Introduction to Computer Science Lecture Notes

Chapter 11 : Computer Science. Information Representation. Class XI ( As per CBSE Board) New Syllabus

Computer Science Applications to Cultural Heritage. Introduction to computer systems

The Unicode Standard Version 11.0 Core Specification

Topic Notes: Bits and Bytes and Numbers

The process of preparing an application to support more than one language and data format is called internationalization. Localization is the process

The Unicode Standard Version 12.0 Core Specification

Functional Programming in Haskell Prof. Madhavan Mukund and S. P. Suresh Chennai Mathematical Institute

Chapter 4: Data Representations

9/3/2015. Data Representation II. 2.4 Signed Integer Representation. 2.4 Signed Integer Representation

Memory Addressing, Binary, and Hexadecimal Review

Lecture (02) Operations on numbering systems

Who Said Anything About Punycode? I Just Want to Register an IDN.

Terry Carl Walker 1739 East Palm Lane Phoenix, AZ United States

This manual describes utf8gen, a utility for converting Unicode hexadecimal code points into UTF-8 as printable characters for immediate viewing and

Variables, Constants, and Data Types

UNIT 7A Data Representation: Numbers and Text. Digital Data

1.1 Information representation

2011 Martin v. Löwis. Data-centric XML. Character Sets

2007 Martin v. Löwis. Data-centric XML. Character Sets

The type of all data used in a C++ program must be specified

Regular Expressions. Todd Kelley CST8207 Todd Kelley 1

Friendly Fonts for your Design

1 Achieving IND-CPA security

REVIEW. The C++ Programming Language. CS 151 Review #2

OBJECTIVES After reading this chapter, the student should be able to:

Coding & Computation or What s all this about 1s and 0s? File: coding-computation-slides.tex.

COSC431 IR. Compression. Richard A. O'Keefe

Computer Organization

Review of HTML. Ch. 1

Strings, Lists, and Sequences

PDF Form Design. PDF Form Design. Carmen Matthews AccessU May 15, Texas Workforce Commission

Error Detection and Parity Lesson Plan

CS144: Content Encoding

What we already know. more of what we know. results, searching for "This" 6/21/2017. chapter 14

Unicode HOWTO Release 3.3.5

HANDOUT: COMPUTER PARTS

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval

16 Multiple Inheritance and Extending ADTs

The newunicodechar package

semidbm Documentation

Bits, bytes, binary numbers, and the representation of information

Can R Speak Your Language?

BINARY FOR THE YOUNGER SET

pybdg Documentation Release 1.0.dev2 Outernet Inc

The type of all data used in a C (or C++) program must be specified

Chapter 101. Coding & Computation. Coding & Computation or What s all this about 1s and 0s? Topics

Introduction to Regular Expressions Version 1.3. Tom Sgouros

Lecture 3 Strings Euiseong Seo

The next three class meetings will focus on Chapter 6 concepts. There will be a labtest on Chapter 5 concepts on Thurs Nov 20/Fri Nov 21.

Slide Set 2. for ENCM 335 in Fall Steve Norman, PhD, PEng

Excel Basics: Working with Spreadsheets

Reviewed by Tyler M. Heston, University of Hawai i at Mānoa

CS 115 Data Types and Arithmetic; Testing. Taken from notes by Dr. Neil Moore

Files on disk are organized hierarchically in directories (folders). We will first review some basics about working with them.

Lecture C1 Data Representation. Computing and Art : Nature, Power, and Limits CC 3.12: Fall 2007

TABLE OF CONTENTS 2 CHAPTER 1 3 CHAPTER 2 4 CHAPTER 3 5 CHAPTER 4. Algorithm Design & Problem Solving. Data Representation.

Unicode HOWTO Release 2.6.4

DEBUGGING SERIAL COMMUNICATIONS WITH OTHER DEVICES

3 The Building Blocks: Data Types, Literals, and Variables

CMSC 201 Fall 2016 Lab 09 Advanced Debugging

Basic data types. Building blocks of computation

Week - 04 Lecture - 01 Merge Sort. (Refer Slide Time: 00:02)

CIT 590 Homework 5 HTML Resumes

St. Benedict s High School. Computing Science. Software Design & Development. (Part 2 Computer Architecture) National 5

Math for Liberal Studies

Transcription:

Representing text on the computer: ASCII, Unicode, and UTF 8 STAT/CS 287 Jim Bagrow Question: computers can only understand numbers. In particular, only two numbers, 0 and 1 (binary digits or bits). So how do we get computers to read and write text? Originally, computers (like in the 1950s) didn t bother with this problem, they just expected their operators to read binary data. But eventually this had to be solved. The solution: determine a way to map each letter/symbol/character to a string of bits. Character = smallest possible component of a text. n 2 n 2 n n = 7 With bits you have unique combinations so we can represent different characters. gives 128 symbols. This is enough to store the basics of the Roman alphabet (26 lower case letters + 26 upper case letters + 10 digits + punctuation, plus computer control codes like end of file, start of file, etc.) There are 95 printable characters and 33 non-printing control codes. However, the mapping from bits to symbols is arbitrary, it s just a table. To exchange data between systems, they need to understand each other s mapping. A standard was needed: ASCII. ASCII lookup table: character ASCII CODE (number) binary representation In some sense the ASCII code is the character, in that we are not concerned with the glyph of the character, its visual representation. An operating system or software toolkit for graphical interfaces usually deals with the problem of taking a character and representing it visually. Letter ASCII Code Binary Letter ASCII Code Binary a 097 01100001 A 065 01000001 b 098 01100010 B 066 01000010 c 099 01100011 C 067 01000011... The table is arbitrary but the designers of ASCII settled on some clever ideas. For example, the upper and lower case versions of a letter are always separated by 32, so flipping a single bit will change from upper- to lowercase. The process of mapping from the actual character to the binary representation (01100001) is called encoding. Mapping back from the binary to the character is called decoding.

A text file is just a long sequence of 8-bit blocks. A program reads them 8 bits at a time and then determines the corresponding letters. Notice that each binary representation is 8 bits or one byte. 2 = 256 possible numbers. But ASCII defines 128 characters/symbols (0 127) and the first bit of every byte is zero. Why? ASCII predates many things, including the internet, including the definition of bytes as 8 bits. Many people argued at the time for 7-bits as a word and computers were optimized to process data in 7-bit chunks. This is why it s 128 symbols and 7 bits. Eventually, people settled on 8 bits to a byte, optimized their hardware accordingly, and decided to make ascii backwards compatible by sticking a zero on the front. Problem solved. ASCII s 128 symbols don t leave much room for nice looking text like left-vs-right quotes, long and short dashes, and language-specific symbols like diacriticals (umlauts). North Americans didn t care too much but Europeans, Russians, etc. did. So they started using the new 128 symbols made available by putting a 1 at the beginning of each byte. But every country developed their own encodings for what characters 128 255 mean, and reading the text file with the wrong lookup table you will read the file incorrectly, leading to garbage. Eventually a new standard was settled on for these characters called latin 1 which standardized somewhat the next 128 symbols, but this doesn t solve the real problem: 256 symbols is nowhere near enough space to store all the world s languages. 8 What happens when, say, an international network spanning the entire world shows up, and we want to communicate with everyone? Enter Unicode A non-profit, the Unicode Consortium, was established in 1991 to make a new lookup table for EVERYTHING. Miraculously, over 25 or so years, they have built a standard table for characters. The Unicode table consists of 17 planes each of which can represent 65,536 2 16 giving a total of 1,114,112 possible characters. ( ) Most of Unicode is empty. 136,690 characters have been assigned as of ver. 10.0 (Jun 2017) or 12.27%. There is likely enough space to represent any character in any language (including extinct languages, fictional languages, emoji) in existence. Like ASCII, there s some cleverness in how the table is organized. In fact, the original ASCII numbers are the same in Unicode. Backwards compatible! The 1.1 million slots in the Unicode table are called code points. We can think of them as just numbers. But what Unicode does not do, is character encoding. How to map those numbers to, e.g., bits and bytes????

Encoding Unicode There was a long fight over how best to encode Unicode. In fact, much of the Unicode design (its planes) is based around an intended encoding called UTF 16 (UTF = Unicode Transformation Format) n = 24 2 24 The obvious way to encode Unicode is just more bits. gives = 16.7 million, plenty of room. Problems: 1. Since Unicode is backwards compatible with ASCII. The letter A is now going to be a whole mess of zeros followed by a few 1s and zeros all the way at the right. And this happens for every single character. Wasteful, it automatically makes every English-language plain text file 3 times bigger (for n = 24). Have to get rid of all the zeros in English text. 2. Historically, 8 zeros in a row is a special computer control code representing the end of the string of characters. So if you send a file with all those leading zeros, many computers will assume the text terminates before it does! Cannot send 8 zeros in a row, anywhere within a valid character. 3. Backwards compatibility. If you are reading plaintext English in this new encoding with code that only understands ASCII, ideally it will still be read correctly. UTF 8 solves all of this!!! UTF 8 Brilliant idea by Ken Thompson (one of the creators of UNIX). UTF 8 is a type of multibyte encoding. Sometimes you only use 8 bits to store the character, other times, 16 or 24 or more. The challenge with a multibyte encoding is how do you know, for example, that these 16 bits represent a single two-byte character and not two one-byte characters. UTF 8 solves this character boundary problem! First, if you have a Unicode codepoint under 128 (which is ASCII), you put down a zero and then the seven bits of ASCII. All ASCII is automatically UTF 8! Now, what if we have a codepoint > 128. First, use 110 to indicate the start of a new character of two bytes (the two ones tells us it s two bytes). Then the second byte begins with a 10 to indicate continuation. This leaves 5+6 blank spaces in the remaining bits of these two bytes: 110xxxxx 10xxxxxx (I m using the <space><pipe><space> to represent the byte boundaries and x to represent a placeholder for any value of a bit) These blank spaces then encode the character: 11 spaces or 2k characters. What about more than that? Start with a header of 1110 instead of 110 This tells us there are three bytes in the character (b/c there are three ones). The next two bytes again have 10 to indicate continuation.

1110xxxx 10xxxxxx 10xxxxxx 16 spaces to represent characters. Want more? You can go all the way to: 1111110x... or six bytes. which is 1 + a bunch of possible characters. UTF 8 is backwards compatible with ASCII. It avoids waste. It never sends 8 zeros in a row. Most importantly, and why UTF 8 is (finally) the winning encoding. You can read through the string of bits very easily. You don t need an index tracking which character s bytes you are. If you are halfway through the string and you want to go back to the previous character, you just scan backwards until you see the previous header. In 2008 UTF 8 became the most popular encoding in the world, surpassing pure ASCII, accordng to Google. As of November 2017, UTF 8 accounts for 90.1% of web pages. Python and Unicode https://docs.python.org/3/howto/unicode.html With this history lesson we ve actually addressed one of the challenges with working with Unicode text: the error messages. UnicodeDecodeError - Something is wrong reading the binary data into codepoints that python works with UnicodeEncodeError - Something is wrong writing the code points into binary data UnicodeEncodeError s often happen when you intend to write Unicode but you ve accidentally told Python to only allow plain ASCII. When it encounters a non-ascii character, you get an error: UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128) position 0 - first character is bad ordinal not in range(128) - our code point is not in [0,127] ascii codec - our encoding standard is called a codec and we chose ASCII \ua000 - a two-digit hexadecimal encoding of the Unicode code point. Sometimes these are written as U+xxxxx (hexadecimal is base 16 that s why there are letters in the digits) Reading and writing: The file hander function in Python supports more arguments than just the filename and the mode (r,w etc)

f = open(fname, 'r', encoding='utf-8', errors='strict') encoding : tell Python what encoding standard we want to use: ascii - historical format, supports 128 characters latin-1 - historical format standardized after ASCII went from 7-bit bytes to 8-bit bytes utf-8 - the one true encoding errors : what to do if you encounter and undecode-able byte in the file 'strict' - raise a ValueError error (or a subclass) 'ignore' - ignore the character and continue with the next 'replace' - replace with a suitable replacement character; Python will use the official U+FFFD REPLACEMENT CHARACTER for the builtin Unicode codecs on decoding and '?' on encoding. 'surrogateescape' - replace with private code points U+DCnn. 'xmlcharrefreplace' - Replace with the appropriate XML character reference (only for encoding). 'backslashreplace' - Replace with backslashed escape sequences. 'namereplace' - Replace with \N{...} escape sequences (only for encoding).