Model-based Software Development

Similar documents
Compiler Structure. Backend. Input program. Intermediate representation. Assembly or machine code. Code generation Code optimization

Embedded Systems Development

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

Compiler Architecture

Code Generation. CS 540 George Mason University

P = time to execute I = number of instructions executed C = clock cycles per instruction S = clock speed

17. Instruction Sets: Characteristics and Functions

Basic Computer Architecture

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines

Ti Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr

A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle.

The von Neumann Architecture. IT 3123 Hardware and Software Concepts. The Instruction Cycle. Registers. LMC Executes a Store.

omputer Design Concept adao Nakamura

55:132/22C:160, HPCA Spring 2011

Real instruction set architectures. Part 2: a representative sample

PIPELINE AND VECTOR PROCESSING

Challenges in the back end. CS Compiler Design. Basic blocks. Compiler analysis

Advanced Computer Architecture

Interfacing Compiler and Hardware. Computer Systems Architecture. Processor Types And Instruction Sets. What Instructions Should A Processor Offer?

Compilers and Code Optimization EDOARDO FUSELLA

Administration CS 412/413. Instruction ordering issues. Simplified architecture model. Examples. Impact of instruction ordering

What Are The Main Differences Between Program Counter Pc And Instruction Register Ir

ARM Architecture and Instruction Set

Chapter 2: Instructions How we talk to the computer

VLIW/EPIC: Statically Scheduled ILP

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

Topics Power tends to corrupt; absolute power corrupts absolutely. Computer Organization CS Data Representation

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 14 Very Long Instruction Word Machines

Instruction Set Architecture. "Speaking with the computer"

Typical Processor Execution Cycle

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Intermediate Representations Part II

2. (a) Compare the characteristics of a floppy disk and a hard disk. (b) Discuss in detail memory interleaving. [8+7]

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Processors, Performance, and Profiling

Modern Processors. RISC Architectures

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

CSE 504: Compiler Design. Code Generation

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Lecture Compiler Backend

CS 426 Parallel Computing. Parallel Computing Platforms

Instruction Set Architecture (Contd)

Computer Organization CS 206 T Lec# 2: Instruction Sets

Compiler Code Generation COMP360

Understand the factors involved in instruction set

Instruction-set Design Issues: what is the ML instruction format(s) ML instruction Opcode Dest. Operand Source Operand 1...

Design of Embedded DSP Processors Unit 7: Programming toolchain. 9/26/2017 Unit 7 of TSEA H1 1

ECE 341. Lecture # 15

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

Intermediate Representations

Chapter 2. lw $s1,100($s2) $s1 = Memory[$s2+100] sw $s1,100($s2) Memory[$s2+100] = $s1

Instruction-set Design Issues: what is the ML instruction format(s) ML instruction Opcode Dest. Operand Source Operand 1...

The Processor: Instruction-Level Parallelism

2 MARKS Q&A 1 KNREDDY UNIT-I

Communicating with People (2.8)

Compilers. Lecture 2 Overview. (original slides by Sam

Chapter 2 Instructions Sets. Hsung-Pin Chang Department of Computer Science National ChungHsing University

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Announcement. Computer Architecture (CSC-3501) Lecture 25 (24 April 2008) Chapter 9 Objectives. 9.2 RISC Machines

Cpu Architectures Using Fixed Length Instruction Formats

DETECTION OF SOFTWARE DATA DEPENDENCY IN SUPERSCALAR COMPUTER ARCHITECTURE EXECUTION. Elena ZAHARIEVA-STOYANOVA, Lorentz JÄNTSCHI

CN310 Microprocessor Systems Design

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Superscalar Processors Ch 14

Superscalar Processors

Lecture Outline. Code Generation. Lecture 30. Example of a Stack Machine Program. Stack Machines

Computer Architecture and Organization

Page # Let the Compiler Do it Pros and Cons Pros. Exploiting ILP through Software Approaches. Cons. Perhaps a mixture of the two?

ADVANCED COMPUTER ARCHITECTURE TWO MARKS WITH ANSWERS

Exploitation of instruction level parallelism

CS/COE1541: Introduction to Computer Architecture

William Stallings Computer Organization and Architecture. Chapter 12 Reduced Instruction Set Computers

EN164: Design of Computing Systems Topic 03: Instruction Set Architecture Design

Computer Architecture

Basic Processing Unit: Some Fundamental Concepts, Execution of a. Complete Instruction, Multiple Bus Organization, Hard-wired Control,

CS146 Computer Architecture. Fall Midterm Exam

ALU(B) delay in cycles Arithmetic 32% 1 2 Data Transfer 36% 2 2 Floating Point 10% 3 4 Control Transfer 22% 2 2

Code Generation. Lecture 30

SUPERSCALAR AND VLIW PROCESSORS

Lecture 13 - VLIW Machines and Statically Scheduled ILP

Chapter 2. Instructions: Language of the Computer. Adapted by Paulo Lopes

CPE300: Digital System Architecture and Design

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

An introduction to DSP s. Examples of DSP applications Why a DSP? Characteristics of a DSP Architectures

OptiCode: Machine Code Deobfuscation for Malware Analysis

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

Chapter 9. Pipelining Design Techniques

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics

CPE300: Digital System Architecture and Design

Lecture 4: MIPS Instruction Set

Vorlesung / Course IN2075: Mikroprozessoren / Microprocessors

Chapter 13 Reduced Instruction Set Computers

Instruction Set Design

Processing Unit CS206T

Hardware-based Speculation

The Instruction Set. Chapter 5

COMPUTER ORGANIZATION AND ARCHITECTURE

Chapter 5. A Closer Look at Instruction Set Architectures. Chapter 5 Objectives. 5.1 Introduction. 5.2 Instruction Formats

Transcription:

Model-based Software Development 1 SCADE Suite Application Model in SCADE (data flow + SSM) System Model (tasks, interrupts, buses, ) SymTA/S Generator System-level Schedulability Analysis Astrée ait StackAnalyzer void Task (void) { variable++; function(); next++: } if (next) do this; terminate() Compiler Runtime Error Analysis C-Code Binary Code Worst-Case Execution Time Analysis Stack Usage Analysis

The Task of a Compiler 2 Task of a compiler for a high-level programming language (C, C++, Java,...) is to transform an input program written in the high-level language into a semantically equivalent sequence of machine instructions of the target processor that should have some desired properties: being as short as possible, or being as fast as possible, or using as little energy as possible.

Why care? 3 C code (FIR Filter): int i,j,sum; for (i=0;i<n-m;i++) { sum=0; for (j=0;j<m;j++) { sum+=array1[i+j]*coeff[j]; } output[i]=sum>>15; } Compiler-generated code (gcc):.l21: add %d15,%d3,%d1 addsc.a %a15,%a4,%d15,1 addsc.a %a2,%a5,%d1,1 mov %d4,49 ld.h %d0,[%a15]0 ld.h %d15,[%a2]0 madd %d2,%d2,%d0,%d15 add %d1,%d1,1 jge %d4,%d1,.l21 Hand-written code (inner loop): _8: ld16.w d5, [a2+] ld16.w d4, [a3+] madd.h e8,e8,d5,d4ul,#0 loop a7,_8 ü ü ü 6 times faster! Processor performance depends on quality of code generation. Good code generation is a challenge.

Compiler Structure 4 Input program Frontend Middle End End IR optimization Intermediate representation Backend Assembly or machine code Syntactic and semantic analysis Code generation Code optimization void main(void) { int a, b, c; c=a+b; } r1=dm(i7, m7) r2=pm(i8, m8) r3=r1+r2

Detailed Compiler Structure 5 Source (Text) Lexical Analysis Syntactical Analysis Semantical Analysis High-Level Optimizations High-Level Intermediate Representation Middle- End Decorated Syntax Tree Low-Level Optimization and Code Generation Low-Level Intermediate Representation Back- End Machine Program

13 Backend: Main Phases Code selection Register allocation c=a+b; load adr(a) load adr(b) add store adr(c) b a c Instruction scheduling r1=load adr(a) r2=load adr(b) r2=load r1=add r1, adr(b) r2 r1=add store adr(c), r1, r2 r1 store adr(c), r1

Main Tasks of Code Generation (1) 14 Code selection: Map the intermediate representation to a semantically equivalent sequence of machine operations that is as efficient as possible. Register allocation: Map the values of the intermediate representation to physical registers in order to minimize the number of memory references during program execution. Register allocation proper: Decide which variables and expressions of the IR are mapped to registers and which ones are kept in memory. Register assignment: Determine the physical registers that are used to store the values that have been previously selected to reside in registers.

Main Tasks of Code Generation (2) 15 Instruction scheduling: Reorder the produced operation stream in order to minimize pipeline stalls and exploit the available instruction-level parallelism. Resource allocation / functional unit binding: Bind operations to machine resources, e.g. functional units or buses.

The Code Generation Problem 17 Optimal code selection: NP-complete Optimal register allocation: NP-complete Optimal instruction scheduling: NP-complete All of them are interdependent, i.e. decisions made in one phase may impose restrictions to the other. Usually they are addressed by heuristic methods in separate phases. Thus: often suboptimal combination of suboptimal partial results.

The Phase Ordering Problem 18 d1=s1*s1; d2=s1+s2; R3=R1*R1 store R3 R3=R1+R2 store R3 R3=r1*r1, R4=R1+R2 store R3 store R4 Optimize number of used registers Optimize code speed and size

Impact of Code Selection 19 struct { unsigned b1:1; unsigned b2:1; unsigned b3:1; } s; s.b3 = s.b1 && s.b2; ld.bu d5, [sp]48 extr.u d5, d5, #0, #1 ne d5, d5, #0 mov16 d4, d5 jeq d4, #0, _9 ld.bu d4, [sp]48 extr.u d4, d4, #1, #1 ne d4, d4, #0 _9: ld.b d3, [sp]48 insert d3, d3, d4, #2, #1 st.b [sp]48, d3 ld.bu d5, [sp]48 and.t d3, d5, #0, d5, #1 insert d5, d5, d3, #2, #1 st.b [sp]48, d5 3 cycles, 16 bytes 9 cycles, 42 bytes

28 Intermediate Representations Call Graph Control Flow Graph Basic Block Graph Data Dependence Graph

High-Level and Low-Level IRs 29 High-level intermediate representation: close to source level. Typically centered around source language constructs. Constructs: implicit memory addressing, expression trees, forwhile-, switch-statements, etc. Low-level intermediate representation: close to machine level. Typically centered around basic entities that specify properties of machine operations. Most program representations can be defined at high-level and at low-level.

Call Graph 31 There is a node for the main procedure being the entry node of the program and a node for each procedure declared in the program. The nodes are marked with the procedure names. There is an edge between the node for a procedure p to the node of procedure q, if there is a call to q inside of p.

Control Flow Graph 32 [y := x] 1 ; [z := 1] 2 ; while [y>1] 3 do { [z := z*y] 4 ; [y := y-1] 5 }; [y := 0] 6 [y := x] 1 [z := 1] 2 [y>1] 3 [z := z*y] 4 [y := y-1] 5 [y := 0] 6

33 Control Flow Graph The control flow graph of a procedure is a directed, node and edge labeled graph G C =(N C,E C,n A,n Ω ). For each instruction i of the procedure there is a node n i labeled i. An edge (n,m,λ) with edge label λ {T,F,ε} denotes control flow of a procedure. Subgraphs for composed statements are shown on the next slide. Edges belonging to unconditional branches lead from the node of the branch to the branch destination. node n A is the uniquely determined entry point of the procedure; it belongs to the first instruction to be executed. n Ω is the end node of any path through the control flow graph. Nodes with more than one predecessor are called joins and nodes with more than one successor are called forks.

Control Flow Graph Composed Statements 34 cfg (while B do S od) = cfg (if B then S else S fi) = 1 2 cfg (S ;S ) = 1 2 B F T B F cfg (S ) 1 T cfg (S) cfg (S ) 1 cfg (S ) 2 cfg (S ) 2

35 Basic Block Graph A basic block in a control flow graph is a path of maximal length which has no joins except at the beginning and no forks except possibly at the end. The basic block graph G B =(N B,E B,b A,b Ω ) of a control flow graph G C = (N C,E C,n A,n Ω ) is formed from G C by combining each basic block into a node. Edges of G C leading into the first node of a basic block lead to the node of that basic block in G B. Edges of G C leaving the last node of a basic block lead out of the node of that basic block in G B. The node b A denotes the uniquely determined entry block of the procedure; b Ω denotes the exit block that is reached at the end of any path through the procedure.

Types of Microprocessors 46 Complex Instruction Set Computer (CISC) large number of complex addressing modes many versions of instructions for different operands different execution times for instructions few processor registers microprogrammed control logic Reduced Instruction Set Computer (RISC) one instruction per clock cycle memory accesses by dedicated load/store instructions few addressing modes hard-wired control logic

Example Instruction (IA-32) 47 Execution time: 1 cycle: ADD EAX, EBX 2 cycles: ADD EAX, memvar32 3 cycles: ADD memvar32, EAX 4 cycles: ADD memvar16, AX Instruction Width: between 1 byte (NOP) and 16 bytes

Types of Microprocessors 48 Superscalar Processors subclass of RISCs or CISCs multiple instruction pipelines for overlapping execution of instructions parallelism not necessarily exposed to the compiler Very Long Instruction Word (VLIW) statically determined instruction-level parallelism (under compiler control) instructions are composed of different machine operations whose execution is started in parallel many parallel functional units large register sets

VLIW Architectures 49 operation 0 operation 1 operation 2 instruction slot 0 instruction slot 1

Characteristics of Embedded Processors 50 Multiply-accumulate units: multiplication and accumulation in a single clock cycle (vector products, digital filters, correlation, fourier transforms, etc) Multiple-access memory architectures for high bandwidth between processor and memory Goal: throughput of one operation per clock cycle. Required: several memory accesses per clock cycle. Separate data and program memory space: harvard architecture. Multiple memory banks Arithmetic operations in parallel to memory accesses. But often irregular restrictions. Specialized addressing modes, e.g. bit-reverse addressing or automodify addressing.

Characteristics of Embedded Processors 51 Predicated/guarded execution: instruction execution depends on the value of explicitly specified bit values or registers. Hardware loops / zero overhead loops: no explicit loop counter increment/decrement, no loop condition check, no branch back to top of loop Restricted interconnectivity between registers and functional units -> phase coupling problems. Strongly encoded instruction formats: a throughput of one instruction per clock cycle requires one instruction to be fetched per cycle. Thus each instruction has to fit in one memory word => reduction of bit width of the instruction.

Characteristics of Embedded Processors 52 SIMD instructions: Focus on multimedia applications SIMD: Single Instruction Multiple Data SIMD-Instruction: instruction operating concurrently on data that are packed in a single register or memory location. Halfword 1 Halfword1 Destination1 op Halfword 0 Halfword0 Destination0 a=b+c*z[i+0] d=e+f*z[i+1] r=s+t*z[i+2] w=x+y*z[i+3] a d r w b c e f = s + SIMD t * SIMD x y z[i+0] z[i+1] z[i+2] z[i+3]

Characteristics of Embedded Processors 53 Cost constraints, low power requirements and specialization: Ø Irregularity Ø Phase coupling problems during code generation Ø Need for specialized algorithms

54 The Phase Coupling Problem Main subtasks of compiler backends: code selection: mapping of IR statements to machine instructions of the target processor register allocation: map variables and expressions to registers in order to minimize the number of memory references during program execution register assignment: determine the physical register used to store a value that has been previously selected to reside in a register instruction scheduling: reorder an instruction sequence in order to exploit instruction-level parallelism and minimize pipeline stalls resource allocation: assign functional units and buses to operations

55 The Phase Coupling Problem Classical approaches: isolated solution by heuristic methods (list scheduling, trace scheduling, graph coloring register allocation, etc). Problem: interdependence of code generation phases. Suboptimal combination of suboptimal partial results. Inefficient code. Solution in a perfect world: Address all problems simultaneously in a single phase. BUT: How to formulate this? Code selection, register allocation/register assignment and instruction scheduling by themselves are NP-hard problems. This means that in general there is no chance to optimally solve even one single of these tasks separately.

56 Code Selection and Register Allocation The goal of code selection is to determine the cheapest instruction sequence for a subgraph of the IR. However, the code selector does not know what the real overall costs of the instruction sequence will be; it can use only estimations. Register allocation usually is done after code selection so the code selector typically has to assume an infinite number of registers (virtual registers). In consequence when estimating the cost of an instruction sequence the code selector will mostly assume register references. Register allocation has to cope with a finite number of registers. If there are too few registers, spill code is generated. However, the cost of this spill code has not been considered during code selection, so the chosen operation sequence may in fact be a bad choice since another one might have worked without spill code.