Fast and Accurate Source-Level Simulation Considering Target-Specific Compiler Optimizations

Similar documents
Fast and Accurate Resource Conflict Simulation for Performance Analysis of Multi-Core Systems

Modeling, Analysis and Refinement of Heterogeneous Interconnected Systems Using Virtual Platforms

The Next Generation of Virtual Prototyping: Ultra-fast Yet Accurate Simulation of HW/SW Systems

Constraint-based Platform Variants Specification for Early System Verification

Embedded System Design Modeling, Synthesis, Verification

Embedded System Design

Hardware/Software Co-design

Chapter 2 M3-SCoPE: Performance Modeling of Multi-Processor Embedded Systems for Fast Design Space Exploration

Automatic Instrumentation of Embedded Software for High Level Hardware/Software Co-Simulation

Trace-Based Context-Sensitive Timing Simulation Considering Execution Path Variations

Hardware Design and Simulation for Verification

Cadence SystemC Design and Verification. NMI FPGA Network Meeting Jan 21, 2015

On the Use of Context Information for Precise Measurement-Based Execution Time Estimation

The Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006

Design Space Exploration of Systems-on-Chip: DIPLODOCUS

Cycle-approximate Retargetable Performance Estimation at the Transaction Level

Cycle Accurate Binary Translation for Simulation Acceleration in Rapid Prototyping of SoCs

FZI Forschungszentrum Informatik

Workshop 1: Specification for SystemC-AADL interoperability

Tim Kogel. June 13, 2010

SpecC Methodology for High-Level Modeling

Embedded System Design and Modeling EE382V, Fall 2008

Taking the Right Turn with Safe and Modular Solutions for the Automotive Industry

DEVELOPMENT OF DISTRIBUTED AUTOMOTIVE SOFTWARE The DaVinci Methodology

Cosimulation of ITRON-Based Embedded Software with SystemC

SoC Design for the New Millennium Daniel D. Gajski

Overview. Technology Details. D/AVE NX Preliminary Product Brief

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads)

Co-Design of Many-Accelerator Heterogeneous Systems Exploiting Virtual Platforms. SAMOS XIV July 14-17,

Efficient use of Virtual Prototypes in HW/SW Development and Verification

AUTOBEST: A United AUTOSAR-OS And ARINC 653 Kernel. Alexander Züpke, Marc Bommert, Daniel Lohmann

Exploring SW Performance using SoC Transaction-level Modeling

An Encapsulated Communication System for Integrated Architectures

Memory Access Reconstruction Based on Memory Allocation Mechanism for Source-Level Simulation of Embedded Software

A Meta-Modeling-Based Approach for Automatic Generation of Fault- Injection Processes

Verification Futures The next three years. February 2015 Nick Heaton, Distinguished Engineer

ait: WORST-CASE EXECUTION TIME PREDICTION BY STATIC PROGRAM ANALYSIS

System-level simulation (HW/SW co-simulation) Outline. EE290A: Design of Embedded System ASV/LL 9/10

Transaction-Level Modeling Definitions and Approximations. 2. Definitions of Transaction-Level Modeling

ESL design with the Agility Compiler for SystemC

Modeling and SW Synthesis for

Introduction to MLM. SoC FPGA. Embedded HW/SW Systems

Abstraction Layers for Hardware Design

A High Integrity Distributed Deterministic Java Environment. WORDS 2002 January 7, San Diego CA

Static WCET Analysis: Methods and Tools

An Equivalence Checker for Hardware-Dependent Software

Martin Kruliš, v

System Level Design with IBM PowerPC Models

The Challenges of System Design. Raising Performance and Reducing Power Consumption

1 Publishable Summary

Automotive Networks Are New Busses and Gateways the Answer or Just Another Challenge? ESWEEK Panel Oct. 3, 2007

Improving Timing Analysis for Matlab Simulink/Stateflow

Choosing IP-XACT IEEE 1685 standard as a unified description for timing and power performance estimations in virtual platforms platforms

Design Refinement of Embedded Analog/Mixed-Signal Systems and how to support it*

System On Chip: Design & Modelling (SOC/DAM) 1 R: Verilog RTL Design with examples.

ARM System-Level Modeling. Platform constructed from welltested

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

QEMU and SystemC. Màrius Màrius Montón

Guido Sandmann MathWorks GmbH. Michael Seibt Mentor Graphics GmbH ABSTRACT INTRODUCTION - WORKFLOW OVERVIEW

Modeling and Simulation of System-on. Platorms. Politecnico di Milano. Donatella Sciuto. Piazza Leonardo da Vinci 32, 20131, Milano

Distributed Systems Programming (F21DS1) Formal Verification

SYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS

Codesign Framework. Parts of this lecture are borrowed from lectures of Johan Lilius of TUCS and ASV/LL of UC Berkeley available in their web.

OSCI Update. Guido Arnout OSCI Chief Strategy Officer CoWare Chairman & Founder

Model-Based Design of Automotive RT Applications

System Level Design Technologies and System Level Design Languages

Long Term Trends for Embedded System Design

2. HW/SW Co-design. Young W. Lim Thr. Young W. Lim 2. HW/SW Co-design Thr 1 / 21

Virtual Validation of Cyber Physical Systems

Parallel Simulation Accelerates Embedded Software Development, Debug and Test

Complexity-Reducing Design Patterns for Cyber-Physical Systems. DARPA META Project. AADL Standards Meeting January 2011 Steven P.

Outline. SLD challenges Platform Based Design (PBD) Leveraging state of the art CAD Metropolis. Case study: Wireless Sensor Network

SPACE: SystemC Partitioning of Architectures for Co-design of real-time Embedded systems

Introduction to Electronic Design Automation. Model of Computation. Model of Computation. Model of Computation

AADS+: AADL Simulation including the Behavioral Annex

Multi-core microcontroller design with Cortex-M processors and CoreSight SoC

ECE 587 Hardware/Software Co-Design Lecture 12 Verification II, System Modeling

What are Embedded Systems? Lecture 1 Introduction to Embedded Systems & Software

Real-Time Component Software. slide credits: H. Kopetz, P. Puschner

EFFICIENT AND EXTENSIBLE TRANSACTION LEVEL MODELING BASED ON AN OBJECT ORIENTED MODEL OF BUS TRANSACTIONS

Distributed Operation Layer Integrated SW Design Flow for Mapping Streaming Applications to MPSoC

The Architects View Framework: A Modeling Environment for Architectural Exploration and HW/SW Partitioning

Network simulation with. Davide Quaglia

Predicting Energy and Performance Overhead of Real-Time Operating Systems

Approach for Iterative Validation of Automotive Embedded Systems

Distributed Operation Layer

Semantics-Based Integration of Embedded Systems Models

RTK-Spec TRON: A Simulation Model of an ITRON Based RTOS Kernel in SystemC

Module 20: Multi-core Computing Multi-processor Scheduling Lecture 39: Multi-processor Scheduling. The Lecture Contains: User Control.

Flexible and Executable Hardware/Software Interface Modeling For Multiprocessor SoC Design Using SystemC

From MDD back to basic: Building DRE systems

Non-invasive Software Verification using Vista Virtual Platforms by Alex Rozenman, Vladimir Pilko, and Nilay Mitash, Mentor Graphics

Optimizing Cache Coherent Subsystem Architecture for Heterogeneous Multicore SoCs

ECE 3055: Final Exam

4 th European SystemC Users Group Meeting

AUTOSAR Software Design with PREEvision

Performance Verification for ESL Design Methodology from AADL Models

Modelling, simulation, and advanced tracing for extra-functional properties in SystemC/TLM


Test requirements in networked systems

Transcription:

FZI Forschungszentrum Informatik at the University of Karlsruhe Fast and Accurate Source-Level Simulation Considering Target-Specific Compiler Optimizations Oliver Bringmann 1 RESEARCH ON YOUR BEHALF

Outline Embedded Software Challenges TLM2 Platform Modeling Source-Level Timing Instrumentation Consideration of Compiler Optimizations Experimental Results 2

Trend Towards Multi-Core Embedded Systems Example: Automotive Domain Transition from passive to active safety Active systems: Innovation by interaction of ECUs, added-value by synergetic networking Multi-sensor data fusion and image recognition for automated situation interpretation in proactive cars Well-tailored Embedded Platforms Increasing computation and energy requirements Distributed embedded platforms with energyefficient multi-core embedded processors Challenges 3 Early verification of global safety and timing requirements Consideration of the actual software implementation w.r.t. the underlying hardware Scalable verification methodology for multi-core & distributed embedded systems

Platform Composition CPU IP RAM AXI APB Comm. Proc. (C, MATLAB, UML) CPU Platform (UML) I/O IP-XACT Component Characteristics Component Models Model Transformation Exploration and Refinement Analysis VP Generation Virtual Prototype VP Refinement Exploration and Refinement Analysis 4 Modeling techniques providing a holistic system Derivation of an optimized network architecture Generation of abstract executable models (virtual prototypes)

TLM Timing and Platform Model Abstractions Timing abstractions Untimed (UT) Modeling notion of simulation time is not required, each process runs up to the next explicit synchronization point before yielding Loosely Timed (LT) Modeling Simulation time is used, but processes are temporally decoupled from simulation time until it reaches an explicit synchronization point Approximately Timed (AT) Modeling Processes run in lock-step with SystemC simulation time. Annotated delays are implemented using timeouts (wait) or timed event notifications Platform model abstractions 5 CP = Communicating Processes; parallel processes with parallel point-to-point communication CPT = Communicating Processes + Timing PV = Programmers View; scheduled SW-computation and/or scheduled communication PVT = Programmers View + Timing CC = Cycle Callable; cycle count accurate timing behavior for computation and communication

SystemC TLM 2.0 Loosely Timed Modeling Style SystemC (lock-step sync.) wait (1, SC_MS); wait (1, SC_MS); do_communication(); wait (1, SC_MS); SystemC + TLM 2.0 Loosely Timed Modeling Style (LT) local_offset += sc_time(1, SC_MS); local_offset += sc_time(1, SC_MS); do_communication(local_offset); local_offset += sc_time(1, SC_MS); if (local_offset >= local_quantum) { wait (local_offset); local_offset = SC_ZERO_TIME; } Thread 1 Thread 2 Thread 1 Thread 2 Time Quantum 6 advance simulation time

Inaccuracies induced by Temporal Decoupling Core 1 Core 2 Resource parallel accesses to shared resources (cache, bus) conflicts may delay concurrent accesses temporally decoupled simulation (LT) Simulation: Core 1 t=0 t=1 Reality: Core 1 t=0 t=1 t=2 Core 2 Time Quantum t=0 Core 2 higher priority access simulated after lower priority access preemption not detected explicit synchronization entails severe performance penalty alternative approach: early completion with retro-active adjustments t=0 7 Core 1 Core 2 Resource Model Cache

Conflict Resolution in TLM Platforms TLM+ Resource Model access arbitration for each relevant simulation step despite temporal decoupling delayed activation of a core s simulation thread upon conflict arbitration induces no additional context switches in SystemC simulation kernel based on SystemC TLM-2.0 (downward compatible) Universal approach for fast and accurate TLM simulation Arbitration using Resource Model shared by all users of a resource synchronization of bus accesses simulation of parallel RTOS software tasks Task 1 Task 2 OS Core 1 Bus 8 Task 3 Core 2

Simulation-Based Timing Analysis Interpretation of Binary Code Software Simulation Software as Binary SW SW SW SW HW interpreted during simulation Hardware Model HW HW HW HW SW SW System Model HW HW HW HW SW SW HW Software and hardware model separated Independent Compilation HW: RTL model or instruction set simulator Software timing induced by hardware model Problem: Long simulation time Common system model for SW and HW Combined compilation of HW and SW High simulation speed Problem: Precise timing analysis is difficult at source-code level 9

Compilation Back-annotation 10 Goal Source-Level Timing Instrumentation Static timing prediction of basic blocks with dynamic error correction Proposed Approach Compilation into binary code enriched with debugging information Static execution time analysis with respect to architectural details (e.g. pipeline mode, cache model, ) Back-annotation of the analyzed timing information into the original C/C++ source code Advantages Consideration of architectural details Efficient compilation onto simulation host Considering the influences of dynamic timing effects int f( int a, int b, int c, int d ) { int res; res = (a + b) << c d; delay( 3, ms ); return res; } 3 ms 00000000 <f>: <f+0>: add %o0, %o1, %g1 <f+4>: sub %o2, %o3, %o1 <f+8>: retl <f+c>: sll %g1, %o1, %o0 Important: Requires accurate relation between source code and binary code Run-Time Models for Branch Prediction and Caching have to be incorporated

Combined Source-Level Simulation and Target Code Analysis: State of the Art Schnerr, Bringmann et al. [DAC 2008] static pipeline analysis to obtain basic block execution times instrumentation code to determine cache misses dynamically no compiler optimizations Wang, Herkersdorf [DAC 2009]; Bouchhima et al. [ASP-DAC 2009]; Gao, Leupers et al. [CODES+ISSS 2009] use modified compiler backend to emit annotated source code supports compiler optimizations as binary code and annotated source have same structure Lin, Lo, Tsay [ASP-DAC 2010] very similar to approach of [DAC2008] claims to support compiler optimizations, no details 11 Castillo, Villar et al. [GLSVLSI 2010] improves cache simulation method of [DAC2008] supports compiler optimizations without control flow changes

Timing Instrumentation and Platform Integration Cycle Calculation Functions Use an architectural model of the processor for the cycle calculation Architectural Model Function delay Is used for fine granular accumulation of time C code corresponding to a basic block C code corresponding to the cache analysis blocks of the basic block delay(statically predicted number of cycles); delay(cyclecalculationicache(istart,iend)); delay(cyclecalculationforconditionalbranch()); synchronization consume(cycles collected with delay); Update Adjust Sync CPU Cache Model Branch Prediction Model CPU e.g. I/O access Bus Function consume CPU I/O 12 VP synchronization with respect to accumulated delays Virtual Hardware Usage of the Loosely-Timed (LT) Modeling Approach

Compiler Optimizations and the Relation between Source Code and Binary Code Dead Code Elimination binary-level control flow gets simpler no real problem for back-annotation Moving Code (e.g. Loop Invariant Code Motion) not necessarily modifies binary-level control flow blurs relation between binary-level and source-level basic blocks Loop Unrolling complete unrolling is simple (annotate delays in front of loop) partial unrolling requires dynamic delay compensation Function Inlining may induce radical changes in control flow graph introduces ambiguity as several binary-level basic reference identical source locations Complex Loop Optimizations basic block structure may change completely (Loop Unswitching) execution frequency of basic blocks due to transformation of iteration space (Loop Skewing) 13

Effects of Compiler Optimizations Loop Function Code Loop Invariant Transformations Code Inlining Motion 14

Effects of Compiler Optimizations Loop Loop Invariant Transformation Loop Binary Function Code Inlining Unrolling Code Generation Motion 15

Using Debug Information to Relate Source Code and Optimized Binary Code Debug Information Source-Level CFG Binary-Level CFG Compilers usually do not generate accurate debug information for optimized code Structure of source code and binary code can be completely different No 1:1 relation between source-level and binary-level basic blocks Simply annotating delay attributes does not work 16 To perform an accurate source-level simulation without modifying the compiler relation between source code and binary code must be reconstructed from debug information binary-level control must be approximated during source-level simulation

Removing Ambigous Debug Information always executed before loop body Code Motion created additional reference 17 Constructing the dominator homomorphism relation Remaining ambiguities (caused by multiple function inlining) can be resolved dynamically using path simulation code

Generating Annotated Source Code 18 Reconstruct line references Low-level analysis Analyze basic block execution times using proven commercial tool AbsInt ait Instrumentation and back-annotation Add reference markers to original source code Generate path simulation code to determine binary control flow dynamically Path simulation code simulates execution through binary-level control flow graph Control flow reconstruction allows: o o 0x8010 0x8000 0x800C 0x801C Precise consideration of branch penalties, branch prediction model can be included Not matching all basis blocks to a source-level statement without losing information

Reconstruction considers structure of source code and binary Instrumented source code provides functionality Arbitrary properties can be simulated: timing, memory accesses, power, 19

20 Results

Application Example Traffic Sign Recognition ADAS Function: Traffic sign recognition Functional Network & Hardware Architecture Virtual & real Prototypes Image Capturing Region detection Preprocessing Kreiserkennung Circle detection Segmentation CAMERA DISPLAY RECOGNIZE CLASSIFY Classification Feature Extraction Architectural Model Implementation Images Traffic signs Classification Traffic information FlexRay (TLM) CAMERA FlexRay-Ctrl RECO- GNIZE FlexRay-Ctrl Display FlexRay-Ctrl DISPLAY FlexRay-Ctrl CLASSIFY 21

22 Application Example Traffic Sign Recognition

Conclusion 23 Timing analysis for embedded software considering the target software implementation and the influences of the underlying hardware Fast and accurate solution by combining the advantages of formal analysis and simulation Timing relations are annotated to the original source code even though code optimizations have been applied Effects of branch prediction and basic block interleaving are easily supported by considering the basic block transitions in the target code TLM2 platform modeling provides efficient simulation with late timing corrections using the TLM2 resource model TLM2 resource model controls the synchronization of temporal decoupled platform models Cache accesses are optimistically performed and are corrected afterwards (only timing corrections have to be applied, data corrections not needed) Simulation performance is quite similar to native execution of the pure software functionality at the simulation host Highly scalable in terms of the number of processors/processor cores

Thank you very much for your attention! Questions? Oliver Bringmann FZI Forschungszentrum Informatik Microelectronics System Design Email: bringmann@fzi.de 24