Representing Characters, Strings and Text

Similar documents
UTF and Turkish. İstinye University. Representing Text

Representing Characters and Text

Chapter 4: Computer Codes. In this chapter you will learn about:

IT101. Characters: from ASCII to Unicode

Digital Logic. The Binary System is a way of writing numbers using only the digits 0 and 1. This is the method used by the (digital) computer.

Administrative Notes February 9, 2017

Numbers and Representations

COM Text User Manual

Programming Concepts and Skills. ASCII and other codes

umber Systems bit nibble byte word binary decimal

Chapter 11 : Computer Science. Information Representation. Class XI ( As per CBSE Board) New Syllabus

Beyond Base 10: Non-decimal Based Number Systems

1.1. INTRODUCTION 1.2. NUMBER SYSTEMS

UNIT 7A Data Representation: Numbers and Text. Digital Data

2011 Martin v. Löwis. Data-centric XML. Character Sets

2007 Martin v. Löwis. Data-centric XML. Character Sets

Can R Speak Your Language?

CMPS 10 Introduction to Computer Science Lecture Notes

LING 388: Computers and Language. Lecture 5

Number codes nibble byte word double word

Beyond Base 10: Non-decimal Based Number Systems

COSC 243 (Computer Architecture)

Character Encodings. Fabian M. Suchanek

Chapter 7. Representing Information Digitally

CS144: Content Encoding

Princeton University. Computer Science 217: Introduction to Programming Systems. Data Types in C

1.1 Information representation

Digital Fundamentals

2a. Codes and number systems (continued) How to get the binary representation of an integer: special case of application of the inverse Horner scheme

Chapter 3 DATA REPRESENTATION

Memory Addressing, Binary, and Hexadecimal Review

St. Benedict s High School. Computing Science. Software Design & Development. (Part 2 Computer Architecture) National 5

Google Search Appliance

Homework 1 graded and returned in class today. Solutions posted online. Request regrades by next class period. Question 10 treated as extra credit

Representing Information. Bit Juggling. - Representing information using bits - Number representations. - Some other bits - Chapters 1 and 2.3,2.

Friendly Fonts for your Design

Chapter 3. Information Representation

Review of HTML. Ch. 1

Chapter 9: Data Transmission

The Unicode Standard Version 11.0 Core Specification

Computer Organization and Assembly Language. Lab Session 4

Intermediate Programming & Design (C++) Notation

Numeral Systems (Part II)

Announcement. (CSC-3501) Lecture 3 (22 Jan 2008) Today, 1 st homework will be uploaded at our class website. Seung-Jong Park (Jay)

Digital Computers and Machine Representation of Data

Number System (Different Ways To Say How Many) Fall 2016

[301] Bits and Memory. Tyler Caraza-Harter

Using sticks to count was a great idea for its time. And using symbols instead of real sticks was much better.

Computer Science Applications to Cultural Heritage. Introduction to computer systems

CS 33. Introduction to C. Part 5. CS33 Intro to Computer Systems V 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

M1 Computers and Data

Source coding and compression

Number Systems Prof. Indranil Sen Gupta Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Number Representation

LBSC 690: Information Technology Lecture 05 Structured data and databases

Stream Ciphers. Koç ( ucsb ccs 130h explore crypto fall / 13

Binary Numbers. The Basics. Base 10 Number. What is a Number? = Binary Number Example. Binary Number Example

Computer Organization

Tex with Unicode Characters

CPSC 301: Computing in the Life Sciences Lecture Notes 16: Data Representation

IT 1204 Section 2.0. Data Representation and Arithmetic. 2009, University of Colombo School of Computing 1

DATA REPRESENTATION. By- Neha Tyagi PGT CS KV 5 Jaipur II Shift, Jaipur Region. Based on CBSE curriculum Class 11. Neha Tyagi, KV 5 Jaipur II Shift

LING 408/508: Computational Techniques for Linguists. Lecture 3

Digital Fundamentals

Number Systems II MA1S1. Tristan McLoughlin. November 30, 2013

REPRESENTING INFORMATION:

Lecture 1: What is a computer?

UNIT 2 NUMBER SYSTEM AND PROGRAMMING LANGUAGES

Using non-latin alphabets in Blaise

CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 02, SPRING 2013

ECE260: Fundamentals of Computer Engineering

Computer Science 1001.py. Lecture 19a: Generators continued; Characters and Text Representation: Ascii and Unicode

Final Labs and Tutors

CS-201 Introduction to Programming with Java

Note 8. Internationalization

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval

Coordination! As complex as Format Integration!

Sources of Evidence. CSF: Forensics Cyber-Security. Part I. Foundations of Digital Forensics. Fall 2015 Nuno Santos

MACHINE LEVEL REPRESENTATION OF DATA

Extended Character Sets for UCAS Systems

Data Types Literals, Variables & Constants

SAPGUI for Windows - I18N User s Guide

CS 261 Fall Binary Information (convert to hex) Mike Lam, Professor

Binary Representation. Jerry Cain CS 106AJ October 29, 2018 slides courtesy of Eric Roberts

CS & IT Conversions. Magnitude 10,000 1,

Using Unicode with MIME

Binary. Hexadecimal BINARY CODED DECIMAL

Computer Science 1001.py. Lecture 19, part B: Characters and Text Representation: Ascii and Unicode

CHAPTER 2 - DIGITAL DATA REPRESENTATION AND NUMBERING SYSTEMS

DATA PROCESSING Scheme of work for 2016 SS Three Extension Class

Lecture 25: Internationalization. UI Hall of Fame or Shame? Today s Topics. Internationalization Design challenges Implementation techniques

INTERNATIONAL STANDARD. This is a preview - click here to buy the full publication

Coding Theory. Networks and Embedded Software. Digital Circuits. by Wolfgang Neff

BINARY SYSTEM. Binary system is used in digital systems because it is:

Technical concepts. Some basics of computers today. Comp 399

Course Schedule. CS 221 Computer Architecture. Week 3: Plan. I. Hexadecimals and Character Representations. Hexadecimal Representation

EMu Documentation. Unicode in EMu 5.0. Document Version 1. EMu 5.0

Information Science 1

Practical character sets

Numeral Systems. -Numeral System -Positional systems -Decimal -Binary -Octal. Subjects:

You 2 Software

Transcription:

Çetin Kaya Koç http://koclab.cs.ucsb.edu/teaching/cs192 koc@cs.ucsb.edu Çetin Kaya Koç http://koclab.cs.ucsb.edu Fall 2016 1 / 19

Representing and Processing Text Representation of text predates the use of computers for text processing Text representation was needed for communication equipment One particular commonly used communication equipment was teleprinter (also called as teletypewriter, Teletype or TTY) It was an electromechanical typewriter that can be used to send and receive typed messages over various types of communications channels Teleprinters were adapted to provide a user interface to early mainframe computers and minicomputers, sending typed data to the computer and printing the response Çetin Kaya Koç http://koclab.cs.ucsb.edu Fall 2016 2 / 19

Teleprinters and ASCII Most modern character-encoding schemes are based on ASCII, which was first developed and used for teleprinters ASCII is developed by American Standards Association in early 1960s The first commercial use was as a 7-bit teleprinter code C etin Kaya Koc http://koclab.cs.ucsb.edu Fall 2016 3 / 19

American Standard Code for Information Interchange ASCII was originally based on the English alphabet ASCII encodes 128 characters into 7-bit binary integers 33 of these are non-printing control characters (many now obsolete) A is encoded as (100 0001) 2 = (41) 16 = (65) 10 + is encoded as 010 1011) 2 = (2B) 16 = (43) 10 Çetin Kaya Koç http://koclab.cs.ucsb.edu Fall 2016 4 / 19

American Standard Code for Information Interchange The first 32 (decimal: 0 to 31) and the last code (decimal: 127) are control codes For example, character 10 (0A) represents the line feed function (which causes the printer to advance its paper), character 8 (08) represents backspace, and character 127 (FF) represents Delete Normally a 7-bit code is placed inside an 8-bit (1-byte) binary number, keeping the most significant (leftmost) bit 0 Therefore, a simple text file containing the word hello will have 5 bytes for the 5 letters There may also be 1 byte space for the LF (line feed) character at the end of each line Çetin Kaya Koç http://koclab.cs.ucsb.edu Fall 2016 5 / 19

Extensions of ASCII How do you represent other characters not found in English? If we limit our attention to Western European languages, we still need to be able to represent letters with accents For example, in Spanish, French and German, we have: ç é ä Çetin Kaya Koç http://koclab.cs.ucsb.edu Fall 2016 6 / 19

Extensions of ASCII Since we have 1 byte (8 bits), we have room for 256 characters This gives us 8-bit ASCII, codes from 00 (hex) to FF (hex) Unfortunately, everyone had their idea of what is needed, and many standards bodies developed different variants of ASCII Different operating systems, game consoles, and computers have designed their own ASCII extensions One thing good however is that all of these variants included 7-bit ASCII (sometimes called US-ASCII) as their subset Çetin Kaya Koç http://koclab.cs.ucsb.edu Fall 2016 7 / 19

Extensions of ASCII Finally, 8-bit codes were standardized by International Organization for Standardization The standard ISO/IEC 8859-1:1998 defines these codes It covers an almost complete list of Western European languages It supports an extended list of languages, but missing a few letters Çetin Kaya Koç http://koclab.cs.ucsb.edu Fall 2016 8 / 19

Further Extension of ASCII However, there are many more languages and scripts in the world Consider: Arabic, Armenian, Hebrew, Chinese, Japanese, Korean We need to create a much more comprehensive encoding system Çetin Kaya Koç http://koclab.cs.ucsb.edu Fall 2016 9 / 19

Unicode Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world s writing systems Unicode Goal: One encoding for all scripts of the world! Unicode contains a repertoire of more than 110,000 characters covering 123 scripts and multiple symbol sets Unicode covers almost all scripts in current use today Unicode defines 1,114,112 code points, in the range 0 to 10FFFF Çetin Kaya Koç http://koclab.cs.ucsb.edu Fall 2016 10 / 19

Unicode Unicode can be implemented by different character encodings The most commonly used encoding is UTF-8 UTF stands for Unicode Transformation Format UTF-8 uses one byte for any ASCII character, all of which have the same code values in both UTF-8 and ASCII encoding, and up to four bytes for other characters Of more than a million code points, about 100,000 are assigned Most assignments are in the first 65,536 code points Çetin Kaya Koç http://koclab.cs.ucsb.edu Fall 2016 11 / 19

Unicode Properties Every letter in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0639 This magic number is called a code point The U+ means Unicode and the numbers are hexadecimal U+0639 is the Arabic letter Ain The English letter A would be U+0041. There is no real limit on the number of letters that Unicode can define and in fact they go beyond 65,536 Çetin Kaya Koç http://koclab.cs.ucsb.edu Fall 2016 12 / 19

Unicode Examples The string Hello in Unicode corresponds to five code points: U+0048 U+0065 U+006C U+006C U+006F The encoding scheme determines how these bytes are to be stored As we said, the most commonly used encoding is UTF-8 English text looks exactly the same in UTF-8 as it does in ASCII Specifically, Hello will be stored as 48 65 6C 6C 6F, which is the same as it is stored in ASCII On the other hand, Arabic, Armenian or other letters will be represented according to their (mostly 2-byte) Unicode definitions, found in http://www.unicode.org/charts Çetin Kaya Koç http://koclab.cs.ucsb.edu Fall 2016 13 / 19

Unicode 7-bit ASCII in Python In Python, all strings are stored in UTF-8 Unicode 7-bit ASCII (US-ASCII) Example: Çetin Kaya Koç http://koclab.cs.ucsb.edu Fall 2016 14 / 19

Unicode 8-bit ASCII in Python Çetin Kaya Koç http://koclab.cs.ucsb.edu Fall 2016 15 / 19

Unicode Arabic Çetin Kaya Koç http://koclab.cs.ucsb.edu Fall 2016 16 / 19

Unicode Armenian Çetin Kaya Koç http://koclab.cs.ucsb.edu Fall 2016 17 / 19

Unicode CJK - Chinese Japanese Korean Çetin Kaya Koç http://koclab.cs.ucsb.edu Fall 2016 18 / 19

Unicode CJK - Chinese Japanese Korean Çetin Kaya Koç http://koclab.cs.ucsb.edu Fall 2016 19 / 19