MACHINE REPRESENTATION OF OTHER DATA FORMATS (TEXT CHARACTERS, STRINGS).

Exercises, set 4. MACHINE REPRESENTATION OF OTHER DATA FORMATS (TEXT CHARACTERS, STRINGS). 1. ASCII character encoding and extensions. The first widely used standardized binary code for alphanumeric characters (consider that Morse s alphabet is also alphanumeric code, but maybe not so digital ) was implemented in early 1960-ties for teletypes (TTY Tele Typewriters) by Bell company. Although TTY machines were in use many years before (see Fig. 1.), the standard introduced by Bell and ASA (American Standards Association, one of the former names of today s ANSI American National Standards Institute) was better ordered than other telegraphic codes to support sorting of the texts, and also had some features important for devices other than TYYs. The official name of this standard is ASCII (American Standard Code for Information Interchange). First edition was published in 1963, most important revision in 1967 and last update in 1986. Fig. 1. Teletype machines used by RAF during the World War II. Generic ASCII code is based on 7-bit length code-word (Morse s alphabet uses codes of different lengths, for example). Having Q = 2 7 = 128 different codes we can encode: upper-case letters of Latin alphabet (from A to Z) 26 different characters; lower-case letters of Latin alphabet (from a to z) next 26 different characters; decimal digits (from 0 to 9) next 10 printable characters; Space 1 character (not seen, but printable ); punctuation marks (like dot, coma etc.) and a few miscellaneous symbols (like #, $, %, &, (, ), <, > etc.) 32 different printable characters; some control characters (not printable), used to drive TTYs and transmission process (like STX Start of Text, ETX End of Text, BEL Bell etc.) 33 different codes (including BS Backspace and DEL Delete). 1

The construction of ASCII code was based on organization of data for typical punched paper tapes, which were most popular kind of mass storage media used together with TTY machines (see Fig. 2.). The tapes used for TTYs (for off-line work) could record 7 bits of data in each row, 4 of them (less significant bit positions) were located on the right side of the synchronization / drive track, 3 older dots / bits were positioned on the left side. Fig. 2. Typical punched paper tape storage media one row of holes consists of 3 + 4 data bits and synchronization / drive track between them. The tapes (see right-side picture) with 8-bit encoding (8 data holes or 7 data + 1 parity check holes) were also in use. Control codes (not printable) utilize first 32 values from (0 to 31 decimal) and the last possible 7-bit value (111 1111b) which encodes the DEL (Delete). In binary representation we can see that all these values (excluding DEL) are easy to detect upper 3 bits are all zeros (000b) or the least significant of them is one (001b). Only the DEL is exception from this rule, but this code is also very easy to detect. Most of these codes have no meaning today. They were used for organizing data blocks during transmission sessions with teletypes, marking end of the disk file in older operating systems (like CP/M) etc. Only some of them are still in use (like Backspace, Delete, Escape, Line Feed, Carriage Return for example). The NUL character has important meaning in C-language programming: it s used as the marker for the end of text string generic C language doesn t have the string (or similar) data type, programmers use static or dynamic arrays of characters instead. The Escape code is often used for sending sequences of control codes to printers, terminals and other devices. These Escape-sequences consist of ESC code followed by some parameters which can force the printer to change font for example. All the ASCII control codes are shown in Table 1. 7-bit ASCII codes for printable characters start from the first value which has 1 on the second bit from the left side of the 7-bit code-word (position indexed as 5 starting from the left side and 0 for the first index value). So the older 3 bits for printable character shows pattern 01x or 1xx (x any value). We should exclude the DEL code (111 1111b) form this rule (see above). The codes with 0 on the most significant position (binary patterns 01x xxxx) are assigned to decimal digits, punctuation marks and other special characters. Upper-case letters are encoded with binary patterns of 10x xxxx while the lower-case characters have codes with 1 on two most significant bits (pattern 11x xxxx). Some additional characters (like @, ~, ^,, [, ], \, {, } etc.) are encoded by values from this range too. Important feature of ASCII code is that codes for subsequent digits (0, 1,, 9) and letters (A, B,, Z and a, b,, z) are ordered, that means code of digit 0 has les value than code for digit 1, code for letter A has less value than code for letter B and so on. All 7-bit ASCII printable characters are shown in Table 2. 2

Table 1. 7-bit ASCII control codes (not printable). Binary Octal Decimal Hexadecimal Abbreviation Description 000 0000 000 0 00 NUL Null character 000 0001 001 1 01 SOH Start of Header 000 0010 002 2 02 STX Start of Text 000 0011 003 3 03 ETX End of Text 000 0100 004 4 04 EOT End of Transmission 000 0101 005 5 05 ENQ Enquiry 000 0110 006 6 06 ACK Acknowledgment 000 0111 007 7 07 BEL Bell 000 1000 010 8 08 BS Backspace 000 1001 011 9 09 HT Horizontal Tab 000 1010 012 10 0A LF Line feed 000 1011 013 11 0B VT Vertical Tab 000 1100 014 12 0C FF Form feed 000 1101 015 13 0D CR Carriage return 000 1110 016 14 0E SO Shift Out 000 1111 017 15 0F SI Shift In 001 0000 020 16 10 DLE Data Link Escape 001 0001 021 17 11 DC1 Device Control 1 (oft. XON) 001 0010 022 18 12 DC2 Device Control 2 001 0011 023 19 13 DC3 Device Control 3 (oft. XOFF) 001 0100 024 20 14 DC4 Device Control 4 001 0101 025 21 15 NAK Negative Acknowledgement 001 0110 026 22 16 SYN Synchronous Idle 001 0111 027 23 17 ETB End of Trans. Block 001 1000 030 24 18 CAN Cancel 001 1001 031 25 19 EM End of Medium 001 1010 032 26 1A SUB Substitute 001 1011 033 27 1B ESC Escape 001 1100 034 28 1C FS File Separator 001 1101 035 29 1D GS Group Separator 001 1110 036 30 1E RS Record Separator 001 1111 037 31 1F US Unit Separator 111 1111 177 127 7F DEL Delete Table 2. 7-bit ASCII characters (printable). Binary Dec Hex Glyph Binary Dec Hex Glyph Binary Dec Hex Glyph 010 0000 32 20 SPC 100 0000 64 40 @ 110 0000 96 60 ` 010 0001 33 21! 100 0001 65 41 A 110 0001 97 61 a 010 0010 34 22 " 100 0010 66 42 B 110 0010 98 62 b 010 0011 35 23 # 100 0011 67 43 C 110 0011 99 63 c 010 0100 36 24 $ 100 0100 68 44 D 110 0100 100 64 d 010 0101 37 25 % 100 0101 69 45 E 110 0101 101 65 e 010 0110 38 26 & 100 0110 70 46 F 110 0110 102 66 f 010 0111 39 27 ' 100 0111 71 47 G 110 0111 103 67 g 010 1000 40 28 ( 100 1000 72 48 H 110 1000 104 68 h 010 1001 41 29 ) 100 1001 73 49 I 110 1001 105 69 i 010 1010 42 2A * 100 1010 74 4A J 110 1010 106 6A j 010 1011 43 2B + 100 1011 75 4B K 110 1011 107 6B K 010 1100 44 2C, 100 1100 76 4C L 110 1100 108 6C l 010 1101 45 2D - 100 1101 77 4D M 110 1101 109 6D m 010 1110 46 2E. 100 1110 78 4E N 110 1110 110 6E n 010 1111 47 2F / 100 1111 79 4F O 110 1111 111 6F o 011 0000 48 30 0 101 0000 80 50 P 111 0000 112 70 p 011 0001 49 31 1 101 0001 81 51 Q 111 0001 113 71 q 011 0010 50 32 2 101 0010 82 52 R 111 0010 114 72 r 011 0011 51 33 3 101 0011 83 53 S 111 0011 115 73 s 011 0100 52 34 4 101 0100 84 54 T 111 0100 116 74 t 011 0101 53 35 5 101 0101 85 55 U 111 0101 117 75 u 011 0110 54 36 6 101 0110 86 56 V 111 0110 118 76 v 011 0111 55 37 7 101 0111 87 57 W 111 0111 119 77 w 011 1000 56 38 8 101 1000 88 58 X 111 1000 120 78 x 011 1001 57 39 9 101 1001 89 59 Y 111 1001 121 79 y 011 1010 58 3A : 101 1010 90 5A Z 111 1010 122 7A z 011 1011 59 3B ; 101 1011 91 5B [ 111 1011 123 7B { 011 1100 60 3C < 101 1100 92 5C \ 111 1100 124 7C 011 1101 61 3D = 101 1101 93 5D ] 111 1101 125 7D } 011 1110 62 3E > 101 1110 94 5E ^ 111 1110 126 7E ~ 011 1111 63 3F? 101 1111 95 5F _ 3

The manufacturers of different text-oriented I/O equipment (printers first of all) soon discovered that some non-printable codes of ASCII set, which have no sense as control codes for devices other than TTYs, could be used to print some extra characters. So depending on manufacturer of the particular printer (EPSON or IBM, for example) we can print characters like,,,,,, and others using codes less than 010 0000b (20h). The same situation we can encounter using text-mode displays (with Widows console applications, for example). The number of TTY control codes utilized this way is not enough to encode different national characters, so this problem was solved in other way (see below). Interesting experiment is easy to be performed with simple C program for printing all the characters for code-values from 0 to 127d on text console, modified for printing characters for 8-bit codes larger than 127d also. The results will be different under Windows and Linux. // For standard C compiler (gcc in UNIX / Linux for example): // #include <stdlib.h> // UNIX / Linux // #include <stdio.h> // UNIX / Linux // For Visual C++ 2012 compiler, Win32 console application: #include "stdafx.h" // Windows Visual C++ 2012 #include <cstdlib> // Windows Visual C++ 2012 #include <iostream> // Windows Visual C++ 2012 using namespace std; // Windows Visual C++ 2012 // int main(int argc, char *argv[]) // UNIX / Linux int _tmain(int argc, _TCHAR* argv[]) // Windows Visual C++ 2012 { unsigned char ch; int i = 0; printf ( "Standard 7-bit ASCII code range:\n" ); for ( ch = 0; ch <= 127; ch ++ ) { printf( "%02xh - %c ", ch, ch ); i++; if ( i == 8 ) { i = 0; printf( "\n" ); } } // Now we want to see characters for codes larger than 128d: i = 0; printf( "\nextended 8-bit characters:\n" ); for ( ch = 128; ch <= 254; ch ++ ) { printf( "%02xh - %c ", ch, ch ); i++; if ( i == 8 ) { i = 0; printf( "\n" ); } } printf( "\n" ); system("pause"); // Windows Visual C++ 2012 return EXIT_SUCCESS; // Windows Visual C++ 2012 // return 0; // UNIX / Linux } Original ( telegraphic ) ASCII code is based on 7-bit code-word but most of today s computers use 8-bit byte as the basic (or shortest) machine word. So there s a possibility to extend ASCII code using this spare most significant bit. The main reason for using extended code is encoding national (not only basic Latin) characters. Some different extensions were defined to solve problem of different national characters, most important international standards based on 8-bit extension to ASCII code are: 4

ISO/IEC 8859 set of standards for different national alphabets defined in the middle 1980-ties by ECMA (European Computer Manufacturers' Association) and then accepted by ISO. In this standard ISO 8859-1 (Latin-1) defines Latin-based character set used in most of Western European languages. ISO 8859-2 (Latin-2) defines Latinbased code-page for most of Central and Eastern European languages (including Polish), ISO 8859-5 supports Cyrillic alphabet (used in Bulgaria, Russia, Serbia and some other countries), ISO 8859-6 covers Arabic alphabet, ISO 8859-7 supports Greek alphabet etc. The codes for national characters start from the binary value of 1010 0000 (A0h see Table 3.) while codes for standard printable Latin characters are compatible with 7-bit ASCII set starting from 0010 0000b. Codes between 1000 0000b and 1001 1111b (80h 9Fh) are used for non-printable control characters. Microsoft code pages Microsoft defined a number of code pages (the term code page was originally introduced by IBM see below) known as the ANSI code pages. As the first one, CP 1252 is based on an ANSI draft which became the ISO 8859-1 standard. It s built on ISO 8859-1 but uses the range of 8-bit codes between 1000 0000b and 1001 1111b (80h 9Fh) for additional printable characters rather than the control codes used in ISO 8859-1. Other Microsoft code pages correspond with other parts of ISO 8859 but are often modified to be closer to 1252. For example: 1250: Central and Eastern European Latin (including Polish); 1251: Cyrillic; 1252: West European Latin (based on ISO 8859-1 with some extra characters); 1253: Greek; 1254: Turkish; 1255: Hebrew; 1256: Arabic; 1257: Baltic; 1258: Vietnamese. IBM PC (OEM) code pages These code pages are/were most often used on PC computers under MS-DOS and similar operating systems (PC-DOS, Caldera DOS, Free DOS etc.). They include a lot of box-drawing (semi-graphic) characters like,,,,,,,,, etc. Since the original IBM PC code page (number 437) was not really designed for international use, several incompatible variants emerged. Microsoft refers to these as the OEM code pages. Examples include: 437: The original IBM PC code page; 737: Greek; 775: Estonian, Lithuanian and Latvian; 850: Multilingual Latin-1 (Western European languages); 852: Slavic Latin-2 (Central and Eastern European languages); 855: Cyrillic; 857: Turkish; 858: Multilingual with euro symbol; 860: Portuguese; 861: Icelandic; 862: Hebrew; 863: French Canadian; 865: Nordic; 866: Cyrillic. 5

Table 3. Different ISO 8859 character sets. Binary Dec Hex 8859-1 2 3 4 5 6 7 8 9 10 11 13 14 15 16 1010 0000 160 A0 "hard space" (NBSP) 1010 0001 161 A1 Ą Ħ Ą Ё Ą ก Ḃ Ą 1010 0010 162 A2 ĸ Ђ Ē ข ḃ ą 1010 0011 163 A3 Ł Ŗ Ѓ Ģ ฃ Ł 1010 0100 164 A4 Є Ī ค Ċ 1010 0101 165 A5 Ľ Ĩ Ѕ Ĩ ฅ ċ 1010 0110 166 A6 Ś Ĥ Ļ І Ķ ฆ Ḋ Š Š 1010 0111 167 A7 Ї ง 1010 1000 168 A8 Ј Ļ จ Ø Ẁ š š 1010 1001 169 A9 Š İ Š Љ Đ ฉ 1010 1010 170 AA ª Ş Ş Ē Њ ͺ ª Š ช Ŗ Ẃ ª Ș 1010 1011 171 AB «Ť Ğ Ģ Ћ «««Ŧ ซ «ḋ ««1010 1100 172 AC Ź Ĵ Ŧ Ќ Ž ฌ Ỳ Ź 1010 1101 173 AD ญ 1010 1110 174 AE Ž Ž Ў Ū ฎ ź 1010 1111 175 AF Ż Ż Џ Ŋ ฏ Æ Ÿ Ż 1011 0000 176 B0 А ฐ Ḟ 1011 0001 177 B1 ± ą ħ ą Б ± ± ± ą ฑ ± ḟ ± ± 1011 0010 178 B2 ² ² В ² ² ² ē ฒ ² Ġ ² Č 1011 0011 179 B3 ³ ł ³ ŗ Г ³ ³ ³ ģ ณ ³ ġ ³ ł 1011 0100 180 B4 Д ī ด Ṁ Ž Ž 1011 0101 181 B5 µ ľ µ ĩ Е µ µ ĩ ต µ ṁ µ 1011 0110 182 B6 ś ĥ ļ Ж Ά ķ ถ 1011 0111 183 B7 ˇ ˇ З ท Ṗ 1011 1000 184 B8 И Έ ļ ธ ø ẁ ž ž 1011 1001 185 B9 ¹ š ı š Й Ή ¹ ¹ đ น ¹ ṗ ¹ č 1011 1010 186 BA º ş ş ē К Ί º š บ ŗ ẃ º ș 1011 1011 187 BB» ť ğ ģ Л»»» ŧ ป» Ṡ»» 1011 1100 188 BC ¼ ź ĵ ŧ М Ό ¼ ¼ ž ผ ¼ ỳ Œ Œ 1011 1101 189 BD ½ ½ Ŋ Н ½ ½ ½ ฝ ½ Ẅ œ œ 1011 1110 190 BE ¾ ž ž О Ύ ¾ ¾ ū พ ¾ ẅ Ÿ Ÿ 1011 1111 191 BF ż ż ŋ П Ώ ŋ ฟ æ ṡ ż 1100 0000 192 C0 À Ŕ À Ā Р ΐ À Ā ภ Ą À À À 1100 0001 193 C1 Á Á Á Á С ء Α Á Á ม Į Á Á Á 1100 0010 194 C2 Â Â Â Â Т آ Β Â Â ย Ā Â Â Â 1100 0011 195 C3 Ã Ă Ã У أ Γ Ã Ã ร Ć Ã Ã Ă 1100 0100 196 C4 Ä Ä Ä Ä Ф ؤ Δ Ä Ä ฤ Ä Ä Ä Ä 1100 0101 197 C5 Å Ĺ Ċ Å Х إ Ε Å Å ล Å Å Å Ć 1100 0110 198 C6 Æ Ć Ĉ Æ Ц ئ Ζ Æ Æ ฦ Ę Æ Æ Æ 1100 0111 199 C7 Ç Ç Ç Į Ч ا Η Ç Į ว Ē Ç Ç Ç 1100 1000 200 C8 È Č È Č Ш ب Θ È Č ศ Č È È È 1100 1001 201 C9 É É É É Щ ة Ι É É ษ É É É É 1100 1010 202 CA Ê Ę Ê Ę Ъ ت Κ Ê Ę ส Ź Ê Ê Ê 1100 1011 203 CB Ë Ë Ë Ë Ы ث Λ Ë Ë ห Ė Ë Ë Ë 1100 1100 204 CC Ì Ě Ì Ė Ь ج Μ Ì Ė ฬ Ģ Ì Ì Ì 1100 1101 205 CD Í Í Í Í Э ح Ν Í Í อ Ķ Í Í Í 1100 1110 206 CE Î Î Î Î Ю خ Ξ Î Î ฮ Ī Î Î Î 1100 1111 207 CF Ï Ď Ï Ī Я د Ο Ï Ï ฯ Ļ Ï Ï Ï 1101 0000 208 D0 Ð Đ Đ а ذ Π Ğ Ð ะ Š Ŵ Ð Đ 1101 0001 209 D1 Ñ Ń Ñ Ņ б ر Ρ Ñ Ņ Ń Ñ Ñ Ń 1101 0010 210 D2 Ò Ň Ò Ō в ز Ò Ō า Ņ Ò Ò Ò 1101 0011 211 D3 Ó Ó Ó Ķ г س Σ Ó Ó า Ó Ó Ó Ó 1101 0100 212 D4 Ô Ô Ô Ô д ش Τ Ô Ô Ō Ô Ô Ô 1101 0101 213 D5 Õ Ő Ġ Õ е ص Υ Õ Õ Õ Õ Õ Ő 1101 0110 214 D6 Ö Ö Ö Ö ж ض Φ Ö Ö Ö Ö Ö Ö 1101 0111 215 D7 з ط Χ Ũ Ṫ Ś 1101 1000 216 D8 Ø Ř Ĝ Ø и ظ Ψ Ø Ø Ų Ø Ø Ű 1101 1001 217 D9 Ù Ů Ù Ų й ع Ω Ù Ų Ł Ù Ù Ù 1101 1010 218 DA Ú Ú Ú Ú к غ Ϊ Ú Ú Ś Ú Ú Ú 1101 1011 219 DB Û Ű Û Û л Ϋ Û Û Ū Û Û Û 1101 1100 220 DC Ü Ü Ü Ü м ά Ü Ü Ü Ü Ü Ü 1101 1101 221 DD Ý Ý Ŭ Ũ н έ İ Ý Ż Ý Ý Ę 1101 1110 222 DE Þ Ţ Ŝ Ū о ή Ş Þ Ž Ŷ Þ Ț 1101 1111 223 DF ß ß ß ß п ί ß ß ß ß ß ß 6

1110 0000 224 E0 À ŕ à ā р ΰ א à ā เ ą à à à 1110 0001 225 E1 Á á á á с ف α ב á á แ į á á á 1110 0010 226 E2 Â â â â т ق β ג â â โ ā â â â 1110 0011 227 E3 Ã ă ã у ك γ ד ã ã ใ ć ã ã ă 1110 0100 228 E4 Ä ä ä ä ф ل δ ה ä ä ไ ä ä ä ä 1110 0101 229 E5 Å ĺ ċ å х م ε ו å å ๅ å å å ć 1110 0110 230 E6 Æ ć ĉ æ ц ن ζ ז æ æ ๆ ę æ æ æ 1110 0111 231 E7 Ç ç ç į ч ه η ח ç į ē ç ç ç 1110 1000 232 E8 È č è č ш و θ ט è č č è è è 1110 1001 233 E9 É é é é щ ى ι י é é é é é é 1110 1010 234 EA Ê ę ê ę ъ ي κ ך ê ę ź ê ê ê 1110 1011 235 EB Ë ë ë ë ы λ כ ë ë ė ë ë ë 1110 1100 236 EC Ì ě ì ė ь μ ל ì ė ģ ì ì ì 1110 1101 237 ED Í í í í э ν ם í í ķ í í í 1110 1110 238 EE Î î î î ю ξ מ î î ī î î î 1110 1111 239 EF Ï ď ï ī я ο ן ï ï ļ ï ï ï 1111 0000 240 F0 Ð đ đ ȑ π נ ğ ð ๐ š ŵ ð đ 1111 0001 241 F1 Ñ ń ñ ņ ё ρ ס ñ ņ ๑ ń ñ ñ ń 1111 0010 242 F2 Ò ň ò ō ђ ς ע ò ō ๒ ņ ò ò ò 1111 0011 243 F3 Ó ó ó ķ ѓ σ ף ó ó ๓ ó ó ó ó 1111 0100 244 F4 Ô ô ô ô є τ פ ô ô ๔ ō ô ô ô 1111 0101 245 F5 Õ ő ġ õ ѕ υ ץ õ õ ๕ õ õ õ ő 1111 0110 246 F6 Ö ö ö ö і φ צ ö ö ๖ ö ö ö ö 1111 0111 247 F7 ї χ ק ũ ๗ ṫ ś 1111 1000 248 F8 Ø ř ĝ ø ј ψ ר ø ø ๘ ų ø ø ű 1111 1001 249 F9 Ù ů ù ų љ ω ש ù ų ๙ ł ù ù ù 1111 1010 250 FA Ú ú ú ú њ ϊ ת ú ú ś ú ú ú 1111 1011 251 FB Û ű û û ћ ϋ û û ū û û û 1111 1100 252 FC Ü ü ü ü ќ ό ü ü ü ü ü ü 1111 1101 253 FD Ý ý ŭ ũ ύ LR ı ý ż ý ý ę M 1111 1110 254 FE Þ ţ ŝ ū ў ώ RL ş þ ž ŷ þ ț M 1111 1111 255 FF Ÿ џ ÿ ĸ ÿ ÿ ÿ The 8-bit code pages suffer from several problems: 1. Some code page vendors have insufficiently documented the meaning of all code point values. This decreases the reliably of handling textual data through various computer systems consistently. 2. Some vendors add extensions to some code pages to add or change certain code values. For example, byte 5Ch can represent either a back slash or a yen currency symbol, depending on the platform. 3. Multiple languages can t be handled in the same application (the code page is set on operating system level in most of cases). Applications can also try to use text in Windows-1252 as ISO-8859-1, the default character set for HTML (if no other is specified in the header of HTML file). Fortunately the only difference between these code pages is that the range ISO-8859-1 reserves for control characters, Windows-1252 uses for some additional printable characters. Since the control codes have no function in HTML, most of web browsers tend to use Windows-1252 by default rather than ISO-8859-1. 2. The Unicode character encoding. Unicode is an industry standard (synchronized with ISO/IEC 10646) allowing computers to represent and manipulate text expressed in most of the world s writing systems. Developed in tandem with the Universal Character Set (UCS) standard and published in book form as The Unicode Standard, Unicode consists of a set of more than 100 000 characters, a set of code charts for visual reference, an encoding methodology and set of standard character 7

encodings, an enumeration of character properties such as upper and lower case, a set of reference data computer files, and a number of related items, such as character properties, rules for normalization, decomposition, collation, rendering and bidirectional display order (for the correct display of text containing both right-to-left scripts, such as Arabic or Hebrew, and left-to-right scripts as Latin or Cyrillic). The Unicode Consortium, the non-profit organization of different manufacturers that coordinates Unicode s development, wants to replace existing character encoding schemes with Unicode and its standard Unicode Transformation Format (UTF) schemes, as many of the existing schemes are limited in size and scope and are incompatible with multilingual environment. The UTF standards have been implemented in many recent technologies, including XML, the Java programming language, the Microsoft.NET Framework and modern operating systems. Unicode can be implemented by different character encodings. The most commonly used and well known encodings are: UTF-8 most popular, variable-length code which uses 1 byte for all standard ASCII (7-bit) Latin characters and up to 4 bytes for other characters. The Unicode characters corresponding to the familiar ASCII set have the same byte values as ASCII, so Unicode characters transformed into UTF-8 can be used with much existing software without significant modifications in programs. UCS-2 not used anymore (obsolete) code which uses 2 bytes for all characters, but does not include every character in the Unicode standard. UTF-16 which extends UCS-2, using 4 bytes to encode all the characters missing from UCS-2. UTF-32 which uses constant-length 4 byte codes for each characters. Written languages are represented by textual elements that are used to create words and sentences. These elements may be letters such as A or s, characters such as those used in Japanese Hiragana to represent syllables or ideographs such as those used in Chinese to represent full words or concepts. The definition of text elements often changes depending on the process handling the text. For example, in historic Spanish language sorting, ll counts as a single text element. However, when Spanish words are typed, ll is two separate text elements: 1 and 1. To avoid deciding what is and is not a text element in different processes, the Unicode Standard defines code elements (commonly called characters ). A code element is fundamental and useful for computer text processing. In most of cases code elements correspond to the most commonly used text elements. In the case of the Spanish ll, the Unicode Standard defines each l as a separate code element. The task of combining two l together for alphabetic sorting is left to the software processing the text. A single number is assigned to each code element defined by the Unicode Standard. Each of these numbers is called a code point and, when referred to in text, is listed in hexadecimal form following the prefix U. For example, the code point U+0041 is the hexadecimal number 0041 (equal to the decimal number 65). It represents the character A in the Unicode Standard and can be stored as one-byte value 41H in computer s memory if system uses UTF-8 encoding. Each character is also assigned a unique name that specifies it and no other. For example, U+0041 is assigned the character name LATIN CAPITAL LETTER A. U+0104 is assigned the character name LATIN CAPITAL LETTER A WITH OGONEK (Polish 8

Ą, ogonek means tail). These Unicode names are identical to the ISO/IEC 10646 names for the same characters. The Unicode Standard groups characters together by scripts in code blocks. A script is any system of related characters. The standard retains the order of characters in a source set where possible. When the characters of a script are traditionally arranged in a certain order (alphabetic order, for example) the Unicode Standard arranges them in its code space using the same order whenever possible. Code blocks vary greatly in size. For example, the Cyrillic code block does not exceed 256 code points, while the CJK (Chinese-Japanese-Korean) code blocks contain many thousands of code points. To have more clear idea about arranging different code blocks we can take a look on webpage http://www.utf8-chartable.de/unicode-utf8-table.pl which presents UTF-8 encoding. Computer text handling involves processing and encoding. For example, when a word processor user is typing text at a keyboard the software receives a message that the user pressed a key combination for T, which it encodes as U+0054. The word processor stores this number in memory (using 1 or more bytes, according to encoding type used by the operating system in most of cases), and also passes it on to the display software responsible for putting the character on the screen. The display software, which may be a part of the word processor itself, uses the number as an index to find an image of a T (a glyph), which it draws on the monitor screen. The difference between identifying a code point and rendering it on screen or paper is basic for understanding the Unicode Standard s role in text processing. The character identified by a Unicode code point is an abstract entity, such as LATIN CHARACTER CAPITAL A or BENGALI DIGIT 5. The mark made on screen or paper (glyph) is a visual representation of the character. The Unicode Standard does not define glyph images. The standard defines how characters are interpreted, not how glyphs are rendered. The software or hardware-rendering engine of a computer is responsible for the appearance of the characters on the screen. The Unicode Standard does not specify the size, shape, nor style of on-screen characters. Text elements are encoded as sequences of one or more characters. Certain of these sequences are called combining character sequences, made up of a base letter and one or more combining marks, which are rendered around the base letter (above it, below it, etc.). For example, a sequence of a followed by a combining circumflex ^ would be rendered as â. The Unicode Standard specifies the order of characters in a combining character sequence. The base character comes first, followed by one or more non-spacing marks. If there is more than one non-spacing mark, the order in which the non-spacing marks are stored isn t important if the marks don t interact typographically. If they do interact, then their order is important. The Unicode Standard specifies how successive non-spacing characters are applied to a base character, and when the order is significant. Certain sequences of characters can also be represented as a single character, called a precomposed character (or composite or decomposible character). For example, the character ü can be encoded as the single code point U+00FC ü or as the base character U+0075 u followed by the non-spacing character U+0308. The Unicode Standard encodes precomposed characters for compatibility with established standards such as Latin-1, which includes many precomposed characters such as â (defined as LATIN SMALL LETTER A WITH CIRCUMFLEX ), ü, ñ etc. 9

Exercises: 1. The text Year 1984 encoded in ASCII code and stored in computer s memory has following representation: 59h, 65h, 61h, 72h, 20h, 31h, 39h, 38h, 34h. Without looking for each character in code table(s) try to encode text YEAR 2008. 2. Using the webpage http://www.utf8-chartable.de/unicode-utf8-table.pl try to encode your name in Unicode (UTF-8) write it down with your national script and write down the sequence of code points. 3. Using C code similar (or the same) as shown in this document try to check if your computer uses ISO 8859, Windows (ANSI) or Unicode character encoding as default for operating system. 10