Slide Set 3. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

Slide Set 3 for ENCM 369 Winter 2018 Section 01 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary January 2018

ENCM 369 Winter 2018 Section 01 Slide Set 3 slide 2/43 Contents ASCII and Unicode Bytes Within Memory Words in MIPS Byte loads and stores in MIPS: lb, lbu, and sb A Complex Stack Frame Example Logical Instructions

ENCM 369 Winter 2018 Section 01 Slide Set 3 slide 3/43 Outline of Slide Set 3 ASCII and Unicode Bytes Within Memory Words in MIPS Byte loads and stores in MIPS: lb, lbu, and sb A Complex Stack Frame Example Logical Instructions

ENCM 369 Winter 2018 Section 01 Slide Set 3 slide 4/43 Review of the ASCII character set Text in English can be represented using ASCII codes, which are integers in the range from 0 up to and including 127. Example ASCII codes... character ASCII code \0 0 \n 10! 33 A 65 a 97 3 51

ENCM 369 Winter 2018 Section 01 Slide Set 3 slide 5/43 How much memory for one ASCII code? Smallest ASCII code in base two: 0000 0000 Largest ASCII code in base two: 0111 1111. So, a single byte (8 bits) is more than large enough to hold any possible ASCII code. Typically, ASCII character strings are stored in memory in arrays of bytes and in file systems as sequences of bytes.

ENCM 369 Winter 2018 Section 01 Slide Set 3 slide 6/43 Unicode: Good to know about but not an ENCM 369 topic Text in most human languages requires characters outside the ASCII character set. (Think of text in Greek, Russian, Arabic, South Asian and East Asian languages, and so on....) Unicode is a system that attempts to represent all of the characters of most of the world s written languages, by, roughly speaking, assigning a unique integer a code point to each character (and assigning more unique integers to mostly regrettable emoji). At present there are roughly 120,000 different code points in Unicode. 2 16 = 65,536, so it s impossible to represent a sequence of code points using exactly two 8-bit bytes per code point.

ENCM 369 Winter 2018 Section 01 Slide Set 3 slide 7/43 UTF-32, UTF-16, and UTF-8 32 bits are more than enough to represent any possible Unicode code point. UTF-32 represents each code point as a 32-bit unsigned number. This is simple but wastes space. UTF-16 is used by Java, and is therefore important. In UTF-16 some characters are represented using single 16-bit chunks, while others are represented as pairs of 16-bit chunks. UTF-8 is in wide use. Some code points need only one byte in UTF-8, some need two bytes, some need three bytes, and, as of now, the rest need four. UTF-8 has two big advantages: (1) It s reasonably space-efficient, and (2) as a result of careful design, any sequence of characters encoded in ASCII is also a UTF-8 encoding of the same sequence of characters.

ENCM 369 Winter 2018 Section 01 Slide Set 3 slide 8/43 Errors on page 322 of Harris and Harris Our textbook says this: Other programming languages, such as Java, use different character encodings, most notably Unicode. Unicode uses 16 bits to represent each character, so it supports accents, umlauts, and Asian languages. That is disappointingly imprecise and inaccurate, especially in a textbook that is very clear and correct in most respects.

ENCM 369 Winter 2018 Section 01 Slide Set 3 slide 9/43 ASCII is the character set of ENCM 369 Unicode is interesting and important, but working with encodings such as UTF-16 and UTF-8 is complicated! (It s not really, really hard it just requires committing time to building understandings of many, many details.) To keep things as simple as possible in ENCM 369, we ll assume that character strings are represented in ASCII, one 8-bit byte per character.

ENCM 369 Winter 2018 Section 01 Slide Set 3 slide 10/43 Outline of Slide Set 3 ASCII and Unicode Bytes Within Memory Words in MIPS Byte loads and stores in MIPS: lb, lbu, and sb A Complex Stack Frame Example Logical Instructions

ENCM 369 Winter 2018 Section 01 Slide Set 3 slide 11/43 Bytes Within Memory Words in MIPS We ve already seen that each memory word can also be used as a group of four bytes with consecutive addresses. For example, consider the two words with addresses 0x1001 0000 and 0x1001 0004. What are the addresses of the bytes in these words? What would be the maximum length of a C character string stored in those bytes?

ENCM 369 Winter 2018 Section 01 Slide Set 3 slide 12/43 Preview of MIPS sb, lb, and lbu instructions The MIPS sb instruction writes a single byte within a 4-byte memory word, leaving the other 3 bytes unchanged. The MIPS lb and lbu instructions both read a single byte from within a 4-byte memory word. (We ll look at the difference between lb and lbu a bit later on.) Before studying these instructions in detail, let s look at the relationship between a byte in memory and the memory word that contains the byte.

ENCM 369 Winter 2018 Section 01 Slide Set 3 slide 13/43 Bit numbering within an n-bit pattern Here is the most usual and most convenient way to do this: the number of the MSB (most significant bit) is n 1; the number of the LSB (least significant bit) is 0. A general n-bit pattern is then b n 1 b n 2 b 2 b 1 b 0, where each b i is either 0 or 1. One reason why this is convenient is that it gives a natural, simple formula for the unsigned integer the bit pattern represents: b n 1 2 n 1 + b n 2 2 n 2 +... + b 2 2 2 + b 1 2 1 + b 0 2 0. (The formula for the signed two s-complement integer the bit pattern represents is almost as simple, as we ll see later in the course.)

ENCM 369 Winter 2018 Section 01 Slide Set 3 slide 14/43 Bit numbering: examples Machine code for MIPS instruction addi $17, $16, 4... 31 26 25 21 20 16 15 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 A byte containing the ASCII code for A, which is 65 ten = 0x41 = 01000001 two. 7 6 5 4 3 2 1 0 0 1 0 0 0 0 0 1

ENCM 369 Winter 2018 Section 01 Slide Set 3 slide 15/43 Bit numbering, continued What you will always see in this course, in lectures, tutorials, labs, and the textbook: Bit n 1 is the MSB; bit 0 is the LSB. What you will see in most current real world computer system documentation: Bit n 1 is the MSB; bit 0 is the LSB. (Same as in this course.) Alternate schemes you might encounter: Bit 0 is the MSB; bit n 1 is the LSB. Bit n is the MSB; bit 1 is the LSB. Bit 1 is the MSB; bit n is the LSB.

ENCM 369 Winter 2018 Section 01 Slide Set 3 slide 16/43 Little-endian organization of bytes within words Little-endian is the name given to this way of addressing bytes within a word: numbers for bits within the word 31... 24 23... 16 15... 8 7... 0 7... 0 7... 0 7... 0 7... 0 +3 +2 +1 +0 address offsets of bytes numbers for bits within bytes MARS has little-endian memory organization. So do some very important architectures, such as x86 and x86-64. There is another widely-used organization called big-endian; we ll study that much later in the course.

ENCM 369 Winter 2018 Section 01 Slide Set 3 slide 17/43 Bytes within words in MARS: a demonstration Use the editor to set up two strings of length 3. (So, 4 bytes per string.) Use to assemble the code. Data Segment display is... Use to show the actual characters...

ENCM 369 Winter 2018 Section 01 Slide Set 3 slide 18/43 Example: External arrays of bytes and words Two C arrays defined outside of function definitions... char foo[ ] = "hello"; int bar[ ] = { -10, 20, 30 }; One way to write MARS A.L. to set up these two arrays....data.globl foo foo:.byte h, e, l, l, o, \0.globl bar bar:.word -10, 20, -30 Note:.byte was used just to show its similarity to.word. It s usually more convenient to use.asciiz for strings.

ENCM 369 Winter 2018 Section 01 Slide Set 3 slide 19/43 Layout of foo and bar within the data segment Fragment of data segment used for foo and bar... higher addresses Why are these two bytes not used? l -30 20-10 \0 l e o h bar[2] bar[1] bar[0] foo[4] foo[0] Access to elements of bar will be with lw and sw instructions, but access to elements of foo will use lb (or lbu) and sb.

ENCM 369 Winter 2018 Section 01 Slide Set 3 slide 20/43 Contents of foo and bar as 1 s and 0 s instead of symbols like h and -10 Fragment of data segment used for foo and bar... higher addresses 11111111 11111111 11111111 11100010 00000000 00000000 00000000 00010100 11111111 11111111 11111111 11110110 00000000 00000000 01101100 01101100 00000000 01100101 01101111 01101000 bar[2] bar[1] bar[0] foo[4] foo[0] Of course, when the program runs, the 1 s and 0 s are high and low voltages at various nodes in a memory circuit.

ENCM 369 Winter 2018 Section 01 Slide Set 3 slide 21/43 Outline of Slide Set 3 ASCII and Unicode Bytes Within Memory Words in MIPS Byte loads and stores in MIPS: lb, lbu, and sb A Complex Stack Frame Example Logical Instructions

ENCM 369 Winter 2018 Section 01 Slide Set 3 slide 22/43 Byte loads and stores in MIPS: lb, lbu, and sb instructions lb and lbu: Both copy a byte from memory into bits 7 0 of a GPR. There is a difference in what happens to bits 31 8 of the GPR let s illustrate that with a picture. sb copies bits 7 0 from a GPR to a memory byte. Bits 31 8 of the GPR are ignored.

ENCM 369 Winter 2018 Section 01 Slide Set 3 slide 23/43 lb and lbu: examples Suppose that $s0 contains 0x1001 0009; the value of the byte at address 0x1001 0009 is 0x99; the value of the byte at address 0x1001 000a is 0x7e. What values will $t0 $t3 get? lb lbu lb lbu $t0, 0($s0) $t1, 0($s0) $t2, 1($s0) $t3, 1($s0)

ENCM 369 Winter 2018 Section 01 Slide Set 3 slide 24/43 sb: examples Suppose that $s1 contains 0x1001 0020; $t0 contains 0x1234 5678. What will these instructions do? sb $t0, ($s1) sb $t0, 1($s1)

ENCM 369 Winter 2018 Section 01 Slide Set 3 slide 25/43 Which to use: lb or lbu? If you know that the byte being read is an ASCII code, it doesn t matter bit 7 of the byte will be 0, so lb and lbu have the same effect. An example on page 324 of the textbook uses lb, but would work equally well if lbu were used instead. Examples in ENCM 369 lectures, labs, and tutorials will use lbu, to be consistent with examples used in the course over the last few years.

ENCM 369 Winter 2018 Section 01 Slide Set 3 slide 26/43 Why do lb and lbu both exist? (part 1) Sometimes arrays of bytes are used not for character codes but for collections of integers with small magnitudes. We re not going to study that in detail right now. The difference between lb and lbu has to do with rules for converting 8-bit integers into 32-bit integers with either sign-extension for signed numbers or zero-extension for unsigned numbers.

ENCM 369 Winter 2018 Section 01 Slide Set 3 slide 27/43 Why do lb and lbu both exist? (part 2) lb does the right thing when an array of bytes is used as a collection of signed char numbers with values from the set { 128, 127,..., 1, 0, 1,..., 126, 127}. lbu does the right thing when an array of bytes is used as a collection of unsigned char numbers with values from the set {0, 1, 2,..., 253, 254, 255}. For more about signed and unsigned number systems, see textbook Section 1.4 (which was covered early in ENEL 353 in Fall 2017) and ENCM 369 lectures in future weeks.

ENCM 369 Winter 2018 Section 01 Slide Set 3 slide 28/43 Byte access programming example void my_strcpy(char *dest, const char *src) { while (*src!= \0 ) { *dest = *src; dest++; src++; } *dest = \0 ; } This is like the strcpy function in the standard C library, except that my_strcpy does not return a value. Let s write a MIPS A.L. translation for my_strcpy.

ENCM 369 Winter 2018 Section 01 Slide Set 3 slide 29/43 Mixing word and byte accesses to memory Usually, data written as bytes is later read back as bytes; data written as words is later read back as words. However, studying code that mixes access types helps to check whether you understand exactly how byte addressing works. Suppose that $s0 contains 0x1001 0000. What will $t2 contain after these instructions are executed...? addi addi sw sb sb lw $t0, $zero, 0xab $t1, $zero, 0xcd $zero, 4($s0) $t0, 7($s0) $t1, 5($s0) $t2, 4($s0)

ENCM 369 Winter 2018 Section 01 Slide Set 3 slide 30/43 Outline of Slide Set 3 ASCII and Unicode Bytes Within Memory Words in MIPS Byte loads and stores in MIPS: lb, lbu, and sb A Complex Stack Frame Example Logical Instructions

ENCM 369 Winter 2018 Section 01 Slide Set 3 slide 31/43 A complex stack frame example int g(char *p, int n); int h(int *q, int j, int k); int f(void) { int a, b; char x[5]; int y[4]; a = g(x, 5); b = h(y, 4, a); // MORE CODE: uses a, b, &b, // x, y, but NEVER &a. return b; }

ENCM 369 Winter 2018 Section 01 Slide Set 3 slide 32/43 Complex stack frame example: Questions How to allocate a, b, x, and y? What are the stack frame needs for f? What will be the layout of the stack frame? What will be the A.L. code for a = g(x, 5);? What will be the A.L. code for b = h(y, 4, a);? Attention: The lecture will skip writing the prologue and epilogue for f. Make sure you know what code would be needed!

ENCM 369 Winter 2018 Section 01 Slide Set 3 slide 33/43 Outline of Slide Set 3 ASCII and Unicode Bytes Within Memory Words in MIPS Byte loads and stores in MIPS: lb, lbu, and sb A Complex Stack Frame Example Logical Instructions

ENCM 369 Winter 2018 Section 01 Slide Set 3 slide 34/43 Logical Instructions Textbook Section 6.4.1 sll: shift left, logical (already seen) srl: shift right, logical or, ori: bitwise OR ( bitwise means operate on multiple pairs of input bits in parallel ) lui: copy 16-bit constant into bits 31 16 of GPR and, andi: bitwise AND nor: bitwise NOR xor, xori: bitwise XOR

ENCM 369 Winter 2018 Section 01 Slide Set 3 slide 35/43 srl: shift right, logical srl is like sll, but shifts right instead of left. Example: Suppose $t0 = 24 zeros 0101 0111. Then what does srl $t1, $t0, 3 put in $t1?

ENCM 369 Winter 2018 Section 01 Slide Set 3 slide 36/43 or, ori OR truth table for one pair of input bits: x y OR(x,y) 0 0 0 0 1 1 1 0 1 1 1 1 MIPS or and ori instructions do 32 simultaneous OR operations on 32 pairs of bits.

ENCM 369 Winter 2018 Section 01 Slide Set 3 slide 37/43 Logical instructions: or example or $t2, $t0, $t1... Suppose $t0 contains 1101_[24 zeros]_0011 and $t1 contains 0001_[24 zeros]_0101. The result, which goes into $t2, is...?

ENCM 369 Winter 2018 Section 01 Slide Set 3 slide 38/43 Logical instructions: or vs. add Attention: bitwise OR is not addition! The OR of 1 and 1 is 1. The arithmetic SUM of 1 and 1 is 0, with a CARRY of 1. Previous example, with or changed to add... $t0 contents... 1101_[24 zeros]_0011 $t1 contents... 0001_[24 zeros]_0101 result, goes to $t2... 1110_[24 zeros]_1000 The add result is not the same as the or result!

ENCM 369 Winter 2018 Section 01 Slide Set 3 slide 39/43 Logical instructions: ori ori: one source is a GPR, the other is a constant. Example: ori $t4, $t3, 0x895a Suppose $t3 contains 0xab00_0034. What value does $t4 get?

ENCM 369 Winter 2018 Section 01 Slide Set 3 slide 40/43 Logical instructions: lui lui: load upper immediate This copies a constant into bits 31 16 of a GPR, and makes bits 15 0 of the GPR zero. Example: lui $t0, 0xf7b3 What value does $t0 get?

ENCM 369 Winter 2018 Section 01 Slide Set 3 slide 41/43 lui and big constants Suppose i is an int in $s0. What would the instruction(s) be for this? i = 0x49b1_ae09; The constant is too big to embed in a single instruction, so the job gets split into two instructions: lui $s0, 0x49b1 # update bits 31-16 ori $s0, $s0, 0xae09 # update bits 15-0 What is in $s0 after lui? After ori?

ENCM 369 Winter 2018 Section 01 Slide Set 3 slide 42/43 lui, ori, li, la li is a MIPS pseudoinstruction for getting a large constant into a GPR. Example: li $s0, 0x49b1ae09 gets translated into what we just saw... lui ori $s0, 0x49b1 $s0, $s0, 0xae09 The assembler handles pseudoinstructions of the form la GPR, label in a similar way.

ENCM 369 Winter 2018 Section 01 Slide Set 3 slide 43/43 Logical instructions: and, andi, nor, xor, xori These are easy to understand if you understand or and ori. Read Section 6.4.1 of the textbook for details and examples.