Assembly Language Programming Introduction

Assembly Language Programming Introduction October 10, 2017

Motto: R7 is used by the processor as its program counter (PC). It is recommended that R7 not be used as a stack pointer. Source: PDP-11 04/34/45/55 processor handbook, Digital Equipment Corporation, 1976.

Why go down to machine language? Access to hardware registers for processor and I/O cards. Access to instructions not known to the compilers. Precise control of code execution in places liable to deadlocks or races at the level of hardware. Atomic operations test-and-set. Violation of compiler conventions for additional optimization (parameter passing, memory allocation, final calls i.e. tail-recursion). Access to rarely used modes of processor work, performing hardware code from ROM memory etc. Hardware-restricted resources, e.g. embedded systems.

How do we pay for it? Laborious and boring (especially initially) coding process. Fantastically easy to make errors. Very hard to debug. Difficult maintenance. Basically unportable (but see compatibility ). For typical programs the compiler-generated code is usually better than the hand-written one.

Easing the pain Only the necessary parts should be written in assembly language. Assembly code should be encapsulated inside well-defined interfaces (procedures/functions). If possible, try to generate the assembly code automatically: macros, rewriting rules, patterns etc.

Viewing generated code Use -S options in GCC compiler, using -fverbose-asm does not hurt either. Look for places, which obviously could be improved. Better yet, before doing that use a profiler, to avoid improving rarely executed code.

Computer architecture preliminary definition Abstract description of computer structure, which is necessary for programmer coding in the machine language (or a similar one). Attention: such a structure can have different hardware implementations, e.g. direct built-in control or microprogramming.

Levels of virtual machine interface ISA: machine language (Instruction Set Architecture) ABI (with operating system services) API (with libraries)

Important processor properties Basic properties of computer system architecture, which the programmer is interested in: basic word size, memory address space, addressing modes, instruction set, execution time (may depend on argument forms), stack organization, interrupt system (number of levels).

Organization of a simple computer Classical von Neumann model. Components: processor, memory, external devices. Buses, DMA channels. Typically programs are stored in operating memory, you cannot tell by looking at bits, whether they are program or data. Instruction = operation code + arguments. Sources of arguments: processor registers, program code, other memory cells. Format of coding, fields. Memory cells. Adressing. Bit, byte, word. Memory size. Memory cycle.

Simplified processor schema General-purpose and special registers Arithmetic-logical unit (ALU) Instruction decoder Instruction counter (program counter) Interregister transfers

General purpose registers On Pentium (32-bits architecture): EAX (AX, AH AL) EBX (BX, BH BL) ECX (CX, CH CL) EDX (DX, DH DL) ESI (SI) EDI (DI) EBP (BP)

Typical special registers instruction count (EIC, not accessible directly), instruction register (IR, not accessible), processor status/control word (FLAGS), stack pointer (ESP), memory address and buffer registers (not accessible), segment registers (CS, DS, ES, FS, GS, SS). Additionally general-purpose EBP register is often used as a frame pointer on stack.

Processor cycle Typical processor cycle (instruction cycle) = phases of instruction execution: 1 fetch fetching from memory the instruction pointed by instruction register 2 decode analyzing the instruction format, finding argument modes 3 read fetching argument(s) from memory 4 execute just that 5 write-back storing the result in register or memory 6 interrupt checking for imterrupts.

Bus Maximum frequency is restricted by the so called bus skew, resulting from unequal speed of signal propagation on parallel lines. Multiplexing addresses and data on the same lines means bus sharing. The same bus lines are used (at different cycles) for sending addresses and data. Additional control lines, e.g. wait states compensate for speed mismatch between processor and memory.

Binary arithmetic and data representation Unsigned integer numbers (natural numbers) simply. Arithmetic operations on them like for base 10. Carry and borrow. Multiple-precision arithmetic.

Representing signed integer numbers Variants: sign-magnitude one s complement (to the decremented base) two s complement (to the base), the highest bit has negative weight. shifted

Arithmetic operations Overflow instead of carry for signed numbers BCD representation with correction codes.

Real number representation Floating-point numbers of the form sign 2 c f, where sign is 1 or -1 c is integer f is fraction Normalization additional condition 1 > f >= 1/2, ensuring unique values for c i f. For zero f = 0. Maximum precision, easy comparisons Arithmetic operations include (temporary) denormalization.

Processor optimization Pipelined processing In pipelined processors) different phases for consecutive instruction are processed in parallel. Th best speedup is for sequences of instructions, complications during changes of control flow or interrupts. In such situations the processor pipe has to be emptied and filled starting from different address. Funny trick on some RISC processors: delayed branch (aka. delay slot) the jump is performed only after executing the next instruction. Of course the compiler should generate appropriate code by shuffling instrction (empirically: possible in 90% cases). Hypothetical execution for conditional branches.

Superscalar architecture Modern processors contain more than one pipeline with independent execution units. As a result, instruction can be executed concurrently. Such an architecture is called superscalar. This architecture has interesting consequences for optimization. Often (e.g. on Pentium) it is better to replaces complex instructions with sequences of simple instruction, becuse they can be executed in parallel.

Efficiency Perverse example: We have program with the execution time of 200 seconds, of which 160 seconds is spent in multiplication. How much faster should multiplication unit works to achieve speedup of 5 times in execution of the program? Let s call this increase in speed w: 200 sec. 5 = 160 sec. w + (200 160) sec. that is 40 sec. = 160 sec. w + 40 sec.

Literatura B.S. Chalk Computer Organisation and Architecture. An Introduction A.S. Tanenbaum Structured Computer Organization D.A. Patterson, D.L. Hennessy Computer Organization and Design. The hardware/software interface Advanced: M.L. Schmit Pentium Processor Optimization Tools