Values (a.k.a. data) representation. Advanced Compiler Construction Michel Schinz

Similar documents
Values (a.k.a. data) representation. The problem. Values representation. The problem. Advanced Compiler Construction Michel Schinz

CS558 Programming Languages

CS558 Programming Languages

CS558 Programming Languages

Intermediate representations IR #1: CPS/L 3. Code example. Advanced Compiler Construction Michel Schinz

CS321 Languages and Compiler Design I Winter 2012 Lecture 13

CS153: Compilers Lecture 15: Local Optimization

COSE212: Programming Languages. Lecture 3 Functional Programming in OCaml

Topic 9: Type Checking

Topic 9: Type Checking

CS 231 Data Structures and Algorithms, Fall 2016

The Typed Racket Guide

CS558 Programming Languages

CS2141 Software Development using C/C++ C++ Basics

Lecture Notes on Memory Layout

Lecture Overview. [Scott, chapter 7] [Sebesta, chapter 6]

Algorithms & Data Structures

G Programming Languages Spring 2010 Lecture 6. Robert Grimm, New York University

Lecture #23: Conversion and Type Inference

Conversion vs. Subtyping. Lecture #23: Conversion and Type Inference. Integer Conversions. Conversions: Implicit vs. Explicit. Object x = "Hello";

G Programming Languages - Fall 2012

CS 11 Haskell track: lecture 1

Types. What is a type?

Harvard School of Engineering and Applied Sciences CS 152: Programming Languages

Typed Racket: Racket with Static Types

Building Java Programs

CS112 Lecture: Primitive Types, Operators, Strings

Lexical Considerations

Exam 1 Prep. Dr. Demetrios Glinos University of Central Florida. COP3330 Object Oriented Programming

Chapter 2: Using Data

Introduction to Programming Using Java (98-388)

CS313D: ADVANCED PROGRAMMING LANGUAGE

Building Java Programs

Bits, Words, and Integers

Fixed-Point Math and Other Optimizations

CSc 372. Comparative Programming Languages. 4 : Haskell Basics. Department of Computer Science University of Arizona

CSc 10200! Introduction to Computing. Lecture 2-3 Edgardo Molina Fall 2013 City College of New York

Programming Lecture 3

Basic operators, Arithmetic, Relational, Bitwise, Logical, Assignment, Conditional operators. JAVA Standard Edition

CE221 Programming in C++ Part 2 References and Pointers, Arrays and Strings

Java Primer 1: Types, Classes and Operators

[0569] p 0318 garbage

G Programming Languages Spring 2010 Lecture 4. Robert Grimm, New York University

The list abstract data type defined a number of operations that all list-like objects ought to implement:

Outline. 1 About the course

A Type is a Set of Values. Here we declare n to be a variable of type int; what we mean, n can take on any value from the set of all integer values.

COMP 250: Java Programming I. Carlos G. Oliver, Jérôme Waldispühl January 17-18, 2018 Slides adapted from M. Blanchette

Polymorphism. Lecture 19 CS 565 4/17/08

Arithmetic and Bitwise Operations on Binary Data

Types and Type Inference

C - Basics, Bitwise Operator. Zhaoguo Wang

Compiling Techniques

Scheme: Data. CS F331 Programming Languages CSCE A331 Programming Language Concepts Lecture Slides Monday, April 3, Glenn G.

More Lambda Calculus and Intro to Type Systems

VARIABLES AND TYPES CITS1001

Exercise 1 ( = 24 points)

CS201 - Introduction to Programming Glossary By

TOPIC 2 INTRODUCTION TO JAVA AND DR JAVA

The Haskell HOP: Higher-order Programming

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 11

Harvard School of Engineering and Applied Sciences CS 152: Programming Languages

Unit 3. Operators. School of Science and Technology INTRODUCTION

1 Epic Test Review 2 Epic Test Review 3 Epic Test Review 4. Epic Test Review 5 Epic Test Review 6 Epic Test Review 7 Epic Test Review 8

Type Systems, Type Inference, and Polymorphism

Index. object lifetimes, and ownership, use after change by an alias errors, use after drop errors, BTreeMap, 309

CS4120/4121/5120/5121 Spring 2016 Xi Language Specification Cornell University Version of May 11, 2016

CSc 372 Comparative Programming Languages. 4 : Haskell Basics

SECTION II: LANGUAGE BASICS

Haskell Types, Classes, and Functions, Currying, and Polymorphism

22c:111 Programming Language Concepts. Fall Types I

Hash Table and Hashing

Twister: Language Reference Manual

Pointers (continued), arrays and strings

Building Java Programs

CSE P 501 Compilers. Java Implementation JVMs, JITs &c Hal Perkins Winter /11/ Hal Perkins & UW CSE V-1

Programming Languages Lecture 15: Recursive Types & Subtyping

Programming refresher and intro to C programming

ESc101 : Fundamental of Computing

Short Notes of CS201

Exercise 1 ( = 18 points)

Pointers (continued), arrays and strings

Building Java Programs. Chapter 2: Primitive Data and Definite Loops

CSE 351: The Hardware/Software Interface. Section 2 Integer representations, two s complement, and bitwise operators

Tools : The Java Compiler. The Java Interpreter. The Java Debugger

Topics. Java arrays. Definition. Data Structures and Information Systems Part 1: Data Structures. Lecture 3: Arrays (1)

Homework #3 CS2255 Fall 2012

Number Representation & Conversion

Example: Haskell algebraic data types (1)

CS 261 Fall C Introduction. Variables, Memory Model, Pointers, and Debugging. Mike Lam, Professor

Types and Type Inference

CS 33. Data Representation, Part 1. CS33 Intro to Computer Systems VII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

First Programming Language in CS Education The Arguments for Scala

The Substitution Model

G Programming Languages - Fall 2012

Review: Exam 1. Your First C++ Program. Declaration Statements. Tells the compiler. Examples of declaration statements

C Language Programming

A First Look at ML. Chapter Five Modern Programming Languages, 2nd ed. 1

CS321 Languages and Compiler Design I. Winter 2012 Lecture 2

Data Types. Every program uses data, either explicitly or implicitly to arrive at a result.

Simply-Typed Lambda Calculus

Transcription:

Values (a.k.a. data) representation Advanced Compiler Construction Michel Schinz 2016 03 10

The problem

Values representation A compiler must have a way of representing the values of the source language using values of the target language. This representation must be as efficient as possible, in terms of both memory consumption and execution speed. For simple languages like C, the representation is trivial as most source values (e.g. C's int, float, double, etc.) map directly to the corresponding values of the target machine. For more complex languages, things are not so easy

The problem Most high-level languages have the ability to manipulate values whose exact type is unknown at compilation time. This is trivially true of all dynamically-typed languages, but also of statically-typed languages that offer parametric polymorphism e.g. Scala, Java 5 and later, Haskell, etc. Generating code to manipulate such values is problematic: how should values whose size is unknown be stored in variables, passed as arguments, etc.?

Example To illustrate the problem of values representation, consider the following L 3 function: (def pair-make (fun (f s) (let ((p (@block-alloc-0 2))) (@block-set! p 0 f) (@block-set! p 1 s) p))) Obviously, nothing is known about the type of f and s at compilation time. How can the compiler generate code to pass f and s around, copy them in the allocated block, etc. given that their size is unknown?

Example Notice that the very same problem appears in staticallytyped languages offering parametric polymorphism, e.g. Scala: def pairmake[t,u](f: T, s: U): Pair[T,U] = new Pair[T,U](f, s)

The solutions

Boxed representation The simplest solution to the values representation problem is to use a (fully) boxed values representation. The idea is that all source values are represented by a pointer to a tagged block (a.k.a. a box), allocated in the heap. The block contains the representation of the source value, and the tag identifies the type of that value. While simple, fully boxed values representation is very costly, since even small values like integers or booleans are stored in the heap. A simple operation like addition requires fetching the integers to add from their box, adding them and storing the result in a newly-allocated box

Tagging To avoid paying the hefty price of fully boxed representation for small values like integers, booleans, etc., tagging can be used for them. Tagging takes advantage of the fact that, on most target architectures, heap blocks are aligned on multiple of 2, 4 or more addressable units. Therefore, some of their least significant bits (LSBs) are always zero. As a consequence, it is possible to use values with non-zero LSBs to represent small values.

Integer tagging example A popular technique to represent a source integer n as a tagged target value is to encode it as 2n + 1. This ensures that the LSB of the encoded integer is always 1, which makes it distinguishable from pointers at run time. The drawback of this encoding is that the source integers have one bit less than the integers of the target architecture. While rarely a problem, some applications become very cumbersome to write in such conditions e.g. the computation of a 32 bits checksum. For that reason, some languages offer several kinds of integer-like types: fast, tagged integers with reduced range, and slower, boxed integers with the full range of the target architecture.

NaN tagging With the current move towards 64 bits architectures, it can make sense to use 64 bits words to represent values. Although standard tagging techniques can be used in that case, it is also possible to take advantage of a characteristic of 64 bits IEEE 754 floating-point values (i.e. double in Java): so-called not-a-number values (NaN), which are used to indicate errors, are identified only by the 12 mostsignificant bits, which must be 1. The 52 least-significant bits can be arbitrary. Therefore, floating-point values can be represented naturally, and other values (pointers, integers, booleans, etc.) as specific NaN values.

On-demand boxing Statically-typed languages have the option of using unboxed/untagged values for monomorphic code, and switching to (and from) boxed/tagged representation only for polymorphic code. Dynamically-typed language implementations can try to infer types to obtain the same result.

(Full) specialization For statically-typed, polymorphic languages, specialization (or monomorphization) is another way to represent values. Specialization consists in translating polymorphism away by producing several specialized versions of polymorphic code. For example, when the type List[Int] appears in a program, the compiler produces a special class that represents lists of integers and of nothing else.

(Full) specialization issues Full specialization removes the need for a uniform representation of values in polymorphic code. However, this is achieved at a considerable cost in terms of code size. Moreover, the specialization process can loop for ever in pathological cases like: class C[T] class D[T] extends C[D[D[T]]]

Partial specialization To reap some of the benefits of specialization without paying its full cost, it is possible to perform partial specialization. For example, since in Java all objects are manipulated by reference, there is no point in specializing for more than one kind of reference (e.g. String and Object). It is also possible to generate specialized code only for types that are deemed important e.g. double in a numerical application and fall back to non-specialized code (with a uniform representation) for the other types. See for example Scala's @specialized annotation.

Comparing solutions The figures below show how an object containing the integer 25, the real 3.14 and the string hello could be represented using the three techniques previously described. fully boxed boxed with integer tagging (fully) specialized 25 51 25 3.14 h e l l o 3.14 h e l l o 3.14 h e l l o

Translation of operations When a source value is encoded to a target value, the operations on the source values have to be compiled according to the encoding. For example, if integers are boxed, addition of two source integers has to be compiled as the following sequence: the two boxed integers are unboxed, the sum of these unboxed values is computed, and finally a new box is allocated and filled with that result. This new box is the result of the addition. Similarly, if integers are tagged, the tags must be removed before the addition is performed, and added back afterwards at least in principle. In the case of tagging, it is however possible to do better for several operations

Tagged integer arithmetic The table below illustrates how the encoded version of three basic arithmetic primitives can be derived for tagged integers. Similar derivations can be made for other operations (division, remainder, bitwise operations, etc.). n + m = 2[( n 1) / 2 + ( m 1) / 2] + 1 = ( n 1) + ( m 1) + 1 = n + m 1 n m = 2[( n 1) / 2 ( m 1) / 2] + 1 = ( n 1) ( m 1) + 1 = n m + 1 n m = 2[(( n 1) / 2) (( m 1) / 2)] + 1 = ( n 1) (( m 1) / 2) + 1 = ( n 1) ( m 1) + 1

L 3 values representation

L 3 values L 3 has the following kinds of values: tagged blocks, functions, integers, characters, booleans, unit. Tagged blocks are represented as pointers to themselves. Functions are currently represented as code pointers, although we will see later that this is incorrect! Integers, characters, booleans and the unit value are represented as tagged values.

L 3 tagging scheme For L 3, we require that the two least-significant bits (LSBs) of pointers are always 0. This enables us to represent integers, booleans and unit as tagged values. The tagging scheme for L 3 is given by the table below. Kind of value LSBs Integer 1 2 Block (pointer) 00 2 Character 110 2 Boolean 1010 2 Unit 0010 2

Values representation phase The values representation phase of the L 3 compiler takes as input a CPS program where all values and primitives are high-level, in that they have the semantics of the L 3 language gives them. It produces a low-level version of that program as output, in which all values are either integers or pointers and primitives correspond directly to instructions of a typical microprocessor. As usual, we will specify this phase as a transformation function called, which maps high-level CPS terms to their low-level equivalent.

Representing L 3 functions L 3 functions also need to be represented specially so that the operations permitted on them e.g. function? can be implemented in the target language. The phase that takes care of representing functions is commonly known as closure conversion. While it logically belongs to the values conversion phase, it will be presented separately in the next lecture. For now, we assume incorrectly that functions are not translated specially: (let f ((f 1 (fun (c 1 n 1,1 ) e 1 )) ) e) = (let f ((f 1 (fun (c 1 n 1,1 ) e 1 )) ) e ) (app f n nc n 1 ) = (app f n nc n 1 )

Representing L 3 continuations Continuations, unlike functions, are limited enough that they do not need to be transformed! Their body must still be transformed recursively: (let c ((c 1 (cnt (n 1,1 ) e 1 )) ) e) = (let c ((c 1 (cnt (n 1,1 ) e 1 )) ) e ) (app c n n 1 ) = (app c n n 1 )

Representing L 3 integers (1) (let l ((n i)) e) where i is an integer literal = (let l ((n 2i+1)) e ) (if (int? n) ct cf) = (let* ((c1 1) (t1 (& n c1))) (if (= t1 c1) ct cf)) & is bit-wise and

Representing L 3 integers (2) (let p ((n (+ n 1 n 2 ))) e) = (let* ((c1 1) (t1 (+ n 1 n 2 )) (n (- t1 c1))) e ) other arithmetic primitives are similar. (if (< n 1 n 2 ) ct cf) = (if (< n 1 n 2 ) ct cf) other integer comparison primitives are similar.

Representing L 3 integers (3) (let p ((n (block-alloc-k n 1 ))) e) = (let* ((c1 1) (t1 (>> n 1 c1)) >> is right shift (n (block-alloc-k t1))) e ) (let p ((n (block-tag n 1 ))) e) = (let* ((c1 1) (t1 (block-tag n 1 )) (t2 (<< t1 c1)) << is left shift (n (+ t2 c1))) e ) other block primitives are similar.

Representing L 3 integers (4) (let p ((n (byte-read))) e) = (let* ((t1 (byte-read)) (c1 1) (t2 (<< t1 c1)) (n (+ t2 c1))) e ) (let p ((n (byte-write n 1 ))) e) = left as an exercise

Representing L 3 characters (let l ((n c)) e) where c is a character literal = (let l ((n [(code-point(c) 3) 110 2 ])) e ) (let p ((n (char->int n 1 ))) e) = (let* ((c2 2) (n (>> n 1 c2))) e ) (let p ((n (int->char n 1 ))) e) = (let* ((c2 2) (t1 (<< n 1 c2)) (n (+ t1 c2))) e )

Representing L 3 booleans (let l ((n #t)) e) = (let l ((n 11010 2 )) e ) (let l ((n #f)) e) = (let l ((n 01010 2 )) e ) (if (bool? n) ct cf) = (let* ((m 1111 2 ) (t 1010 2 ) (r (& n m))) (if (= r t) ct cf))

Representing L 3 unit, etc. (let l ((n #u)) e) = (let l ((n 0010 2 )) e ) (if (unit? n) ct cf) = left as an exercise (halt n) = left as an exercise Names are left untouched by the values representation transformation: n = n

Exercise How does the values representation phase translate the following CPS/L 3 version of the successor function? (let f ((succ (fun (c x) (let* ((c1 1) (t1 (+ x c1))) (app c c t1))))) succ)