Dispatch techniques and closure representations

Dispatch techniques and closure representations Jan Midtgaard Week 3, Virtual Machines for Programming Languages Aarhus University, Q4-2011

Dispatch techniques

Efficient bytecode interpreters (1/2) The bytecode interpreter spends all its time doing fetch-decode-execute Any microsecond saved in each iteration will result in significant speedups. 3 / 23

Efficient bytecode interpreters (2/2) There are several ways to write a byte-code interpreter: Switch-based interpreter Threaded interpreters Direct call threading Direct threading Indirect threading As we will see, some trade portability for performance 4 / 23

Switch-based interpreter (1/2) The classic implementation of a byte-code interpreter uses switch to dispatch: typedef enum { add /*... */ Inst; Inst program[] = { add /*... */ ; Inst* ip = program; void run() { while (1) { switch (*ip++) { case add: { int tmp1 = pop(); int tmp2 = pop(); push(tmp1 + tmp2); break; /*... */ 5 / 23

Switch-based interpreter (2/2) The classic implementation of a byte-code interpreter uses switch to dispatch: Pro: simple portable Con: slow void run() { while (1) { switch (*ip++) { case add: { int tmp1 = pop(); int tmp2 = pop(); push(tmp1 + tmp2); break; /*... */ 6 / 23

Call-threading interpreter (1/2) Alternatively, one can use function pointers as instructions. This dispatching technique is called (direct) call threading. typedef void (*Inst)(); void add() { int tmp1 = pop(); int tmp2 = pop(); push(tmp1 + tmp2); Inst* ip = program; void run() { while (1) { (*ip++)(); /*... */ Inst program[] = { add /*... */ ; 7 / 23

Call-threading interpreter (2/2) Alternatively, one can use function pointers as instructions. This dispatching technique is called (direct) call threading. Pro: Inst* ip = program; portable dispatch takes fewer instructions Con: void run() { while (1) { (*ip++)(); larger (instructions take more than a byte) 8 / 23

Towards a direct-threading interpreter This is sub-optimal: Each instruction always transfers control back to the main loop It does so by calling and returning iteratively If each instruction could just continue the rest of the computation by performing a tail call. (did someone say continuation-passing style?) 9 / 23

Direct-threading interpreter (1/2) Unfortunately C does not have proper tail calls, but we can use a GNU extension of C with labels as values : typedef void* Inst; void run() { Inst program[] = { &&add /*... */ ; Inst* ip = program; goto **ip++; add: { int tmp1 = pop(); int tmp2 = pop(); push(tmp1 + tmp2); goto **ip++; /*... */ 10 / 23

Direct-threading interpreter (2/2) Unfortunately C does not have proper tail calls, but we can use a GNU extension of C with labels as values : Pro: fast Con: less portable (GNU C, not ansi C) larger (instructions take more than a byte) void run() { Inst program[] = { &&add /*... */ ; Inst* ip = program; goto **ip++; add: { int tmp1 = pop(); int tmp2 = pop(); push(tmp1 + tmp2); goto **ip++; /*... */ 11 / 23

Indirect-threading interpreter (1/2) We can reduce the space requirements by one level of indirection. typedef enum { add /*... */ Inst; Inst program[] = { add /*... */ ; void run() { void* table[] = { &&add /*... */ ; Inst* ip = program; goto *table[*ip++]; add: { tmp1 = pop(); tmp2 = pop(); push(tmp1 + tmp2); goto *table[*ip++]; /*... */ 12 / 23

Indirect-threading interpreter (2/2) We can reduce the space requirements by one level of indirection. Pro: (relatively) fast Con: less portable (GNU C, not ansi C) void run() { void* table[] = { &&add /*... */ ; Inst* ip = program; goto *table[*ip++]; add: { tmp1 = pop(); tmp2 = pop(); push(tmp1 + tmp2); goto *table[*ip++]; /*... */ 13 / 23

Summary We ve seen four different techniques for writing a bytecode interpreter. Speedwise they rank roughly as follows: Direct call threading Switch-based interpreter Indirect threading (GNU C) Direct threading (GNU C) 14 / 23

Closure representations

Nested ironments and first-class functions In Java, the lexical ironment is coupled to the stack. On the JVM we can therefore refer to locals as stack offsets. In Pascal procedures can be nested, allowing us to refer to variables in an enclosing scope. In Scheme, ML and Lua where function values are first class, variables can even escape, i.e., outlive their stack frame. As a consequence, the ironment has to be decoupled from the stack. 16 / 23

Closures in the Scheme VM Closures are a standard representation for function values. From the DAIMIScheme VM specification: (load V x T j) Loads a Scheme value into T[j]. V can be:... close-flat for a flat closure (then x is the index of the abstraction and aux-vec holds a 1-element list containing the closure ironment) close-deep for a deep closure (then x is the index of the abstraction and the closure ironment is held by -lex) 17 / 23

A motivating example (define f (lambda (v w x y z) (let ([g (lambda () (let* ([u (car v)] [h (lambda () (let ((i (lambda () (+ w x y z 3)))) (cons i u)))]) h))]) g))) In the above example, the nested functions g, h, and i all refer to variables in the enclosing scope. Deep closures represent the nesting explicitly as a linked list of ironments. 18 / 23

Deep (or linked) closures (Landin 64) (define f (lambda (v w x y z) (let ([g (lambda () (let* ([u (car v)] [h (lambda () (let ((i (lambda () (+ w x y z 3)))) (cons i u)))]) h))]) g))) i s code h s code g s code u v w x y z 19 / 23

Deep (or linked) closures (Landin 64) Con: may create memory leaks (by keeping too many values live) Pro: simple fast to create outer variables in nested programs require search h s code i s code g s code u v w x y z 20 / 23

Flat (or display) closures (Cardelli 84) (define f (lambda (v w x y z) (let ([g (lambda () (let* ([u (car v)] [h (lambda () (let ((i (lambda () (+ w x y z 3)))) (cons i u)))]) h))]) g))) A flat closure gets a copy of its free variables: g s code h s code i s code v w x y z u w x y z w x y z 21 / 23

Flat (or display) closures (Cardelli 84) Flat closures come with a catch: Since each referring closure gets a copy of the value, an assignment to one copy will not be visible to other copies. This is typically handled by boxing: assigned variables are referred to through one level of indirection. g s code h s code i s code v w x y z u w x y z w x y z 22 / 23

Flat (or display) closures (Cardelli 84) Flat closures can be further optimized. For example, two mutually recursive functions can share the same flat ironment. g s code h s code i s code v w x y z u w x y z w x y z 22 / 23

Flat (or display) closures (Cardelli 84) Pro: Con: no memory leaks fast access more costly to create assigned variables require boxing g s code h s code i s code v w x y z u w x y z w x y z 22 / 23

Summary We ve seen two classical closure representations. There are advantages and disadvantages to both. The DAIMIScheme virtual machine supports both. The DAIMIScheme compiler uses a heuristic to choose between the two. As we will see later today, Lua uses a slightly more advanced variation. 23 / 23