System Performance Optimization Methodology for Infineon's 32-Bit Automotive Microcontroller Architecture

Similar documents
Predictable hardware: The AURIX Microcontroller Family

Debug Support, Calibration and Emulation for Multiple Processor and Powertrain Control SoCs

A Seamless Tool Access Architecture from ESL to End Product

A Fast Powertrain Microcontroller. Erik Norden, Patrick Leteinturier, Jens Barrenscheen, Klaus Scheibert, Frank Hellwig. Steering.

The Nios II Family of Configurable Soft-core Processors

systems such as Linux (real time application interface Linux included). The unified 32-

A Seamless Tool Access Architecture from ESL to End Product. Albrecht Mayer (Infineon Microcontrollers) S4D Conference Sophia Antipolis, Sept.

Multi-core microcontroller design with Cortex-M processors and CoreSight SoC

ARM Processors for Embedded Applications

Copyright 2016 Xilinx

The Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006

Current and Prospective High-speed Measurement Systems

32-Bit TC1791. Data Sheet. Microcontrollers. Microcontroller. 32-Bit Single-Chip Microcontroller V

FPGA Adaptive Software Debug and Performance Analysis

4. Hardware Platform: Real-Time Requirements

32-Bit TC1767. Data Sheet. Microcontrollers. 32-Bit Single-Chip Microcontroller V

Hello, and welcome to this presentation of the STM32L4 System Configuration Controller.

The Challenges of System Design. Raising Performance and Reducing Power Consumption

Chapter 5. Introduction ARM Cortex series

Memory Access Time in TriCore 1 TC1M Based Systems

XMEGA Series Of AVR Processor. Presented by: Manisha Biyani ( ) Shashank Bolia (

Proven 8051 Microcontroller Technology, Brilliantly Updated

Remote Keyless Entry In a Body Controller Unit Application

Let s first take a look at power consumption and its relationship to voltage and frequency. The equation for power consumption of the MCU as it

Hello, and welcome to this presentation of the STM32 Flash memory interface. It covers all the new features of the STM32F7 Flash memory.

UAD2 + Universal Access Device2 plus

MICROPROCESSOR BASED SYSTEM DESIGN

Test and Verification Solutions. ARM Based SOC Design and Verification

Chapter Seven Morgan Kaufmann Publishers

LEON4: Fourth Generation of the LEON Processor

Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models. Jason Andrews

Overview of Microcontroller and Embedded Systems

New Generation of CAN Controllers Optimized for 8-bit MCUs

Course Introduction. Purpose: Objectives: Content: Learning Time:

Renesas 78K/78K0R/RL78 Family In-Circuit Emulation

Power Aware Architecture Design for Multicore SoCs

Introduction to ARM LPC2148 Microcontroller

ARM Cortex core microcontrollers 3. Cortex-M0, M4, M7

The CoreConnect Bus Architecture

32 bit Micro Experimenter Board Description and Assembly manual

Chapter 15 ARM Architecture, Programming and Development Tools

TMS320C6678 Memory Access Performance

AVR XMEGA Product Line Introduction AVR XMEGA TM. Product Introduction.

DQ8051. Revolutionary Quad-Pipelined Ultra High performance 8051 Microcontroller Core

AVR Microcontrollers Architecture

Performance Tuning on the Blackfin Processor

Systems in Silicon. Converting Élan SC400/410 Design to Élan SC520

Architectural Differences nc. DRAM devices are accessed with a multiplexed address scheme. Each unit of data is accessed by first selecting its row ad

FlexRay The Hardware View

Designing, developing, debugging ARM Cortex-A and Cortex-M heterogeneous multi-processor systems

Computer System Overview OPERATING SYSTEM TOP-LEVEL COMPONENTS. Simplified view: Operating Systems. Slide 1. Slide /S2. Slide 2.

The MPC500 Family of 32-bit Embedded Controllers from Motorola. Rudan Bettelheim MCU Marketing Manager 32-bit Embedded Controller Division, SPS

Fujitsu System Applications Support. Fujitsu Microelectronics America, Inc. 02/02

RISC-V Core IP Products

NXP Unveils Its First ARM Cortex -M4 Based Controller Family

Computer System Overview

Reminder. Course project team forming deadline. Course project ideas. Friday 9/8 11:59pm You will be randomly assigned to a team after the deadline

High-Performance 32-bit

William Stallings Computer Organization and Architecture 8th Edition. Chapter 5 Internal Memory

S2C K7 Prodigy Logic Module Series

ATmega128. Introduction

Ali Karimpour Associate Professor Ferdowsi University of Mashhad

Smart cards and smart objects communication protocols: Looking to the future. ABSTRACT KEYWORDS

Combining Arm & RISC-V in Heterogeneous Designs

Chapter 6 Storage and Other I/O Topics

XC2000 Family AP Application Note. Microcontrollers. XC2236N Drive Card Description V1.0,

The ARM Cortex-A9 Processors

AN4777 Application note

Computer and Hardware Architecture II. Benny Thörnberg Associate Professor in Electronics

TRACE32. Product Overview

EEMBC s Automotive/Industrial Microprocessor Benchmarks. June 4, 2004

Department of Computer Science, Institute for System Architecture, Operating Systems Group. Real-Time Systems '08 / '09. Hardware.

EE 354 Fall 2015 Lecture 1 Architecture and Introduction

DoCD IP Core. DCD on Chip Debug System v. 6.02

Generic Model of I/O Module Interface to CPU and Memory Interface to one or more peripherals

The Design and Implementation of a Low-Latency On-Chip Network

RZ Embedded Microprocessors

Microchip Serial EEPROMs

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter

Cache Justification for Digital Signal Processors

XE166 Family AP Application Note. Microcontrollers. X E D r i v e C a r d H a r d w a r e D e s c r i p t i o n Board REV.

A Scalable Multiprocessor for Real-time Signal Processing

Combining the Power of DAVE and SIMULINK

Approximately half the power consumption of earlier Renesas Technology products and multiple functions in a 14-pin package

CPU offloading using SoC fabric Avnet Silica & Enclustra Seminar Getting started with Xilinx Zynq SoC Fribourg, April 26, 2017

New STM32 F7 Series. World s 1 st to market, ARM Cortex -M7 based 32-bit MCU

Embedded Systems: Architecture

Novel Intelligent I/O Architecture Eliminating the Bus Bottleneck

Freescale and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their

Classification of Semiconductor LSI

Migrating to Cortex-M3 Microcontrollers: an RTOS Perspective

Techniques for Optimizing Performance and Energy Consumption: Results of a Case Study on an ARM9 Platform

Hardware Design. MicroBlaze 7.1. This material exempt per Department of Commerce license exception TSU Xilinx, Inc. All Rights Reserved

Real-Time Debugging Highly Integrated Embedded Wireless Devices

QUESTION BANK UNIT-I. 4. With a neat diagram explain Von Neumann computer architecture

Quick Reference Card. Timing and Stack Verifiers Supported Platforms. SCADE Suite 6.3

Intelop. *As new IP blocks become available, please contact the factory for the latest updated info.

The S6000 Family of Processors

A Cache Hierarchy in a Computer System

Growth outside Cell Phone Applications

Transcription:

System Performance Optimization Methodology for Infineon's 32-Bit Automotive Microcontroller Architecture Albrecht Mayer, Frank Hellwig Infineon Technologies, Am Campeon 1-12, 85579 Neubiberg, Germany albrecht.mayer@infineon.com, frank.hellwig@infineon.com Abstract Microcontrollers are the core part of automotive Electronic Control Units (ECUs). A significant investment of the ECU manufacturers and even their customers is linked to the specified microcontroller family. To preserve this investment it is required to continuously design new generations of the microcontroller with hardware and software compatibility but higher system performance and/or lower cost. The challenge for the microcontroller manufacturer is to get the relevant inputs for improving the system performance, since a microcontroller is used by many customers in many different applications. For Infineon s latest TriCore based 32-bit microcontroller product line, the required statistical data is gathered by using the trace features of the Emulation Device (ED). Infineon s customers use EDs in their unchanged target system and application environment. With an analytical methodology and based on this statistical data, the performance improvements of different SoC architecture and implementation options can be quantified. This allows an objective assessment of improvement options by comparing their performance cost ratios. 1. Introduction High-end automotive microcontrollers are very powerful devices which are used for many different applications. The peripheral set of a specific microcontroller is adapted to an area like power train (engine control, transmission control, etc.) but it is still flexible and general enough to also be used for completely different purposes. This means that from a microcontroller manufacturer perspective there are many customers and many applications and even for the same application customers will design their own unique software solution. Infineon s 32-bit microcontrollers allow for instance software partitioning between TriCore and PCP (Peripheral Control Processor) cores. What all customers have in common is that they want new versions of a microcontroller with the same, software compatible behavior, but with higher performance and lower pricing. The last two requirements are widespread for all manufacturers of SoCs, but the first one (software compatibility) results in a different problem class for system optimization than for example the design of next generation wireless baseband chips. A popular system design methodology is the V-model [1]. In essence it is a top-down design and a bottom-up validation flow. Except for the first generation of a new microcontroller family, the development of future versions is evolutionary to meet the compatibility requirements, whilst there is high pressure to increase system performance and to reduce cost. This can be described with the F-model, where after a new microcontroller generation is available, customers start to use it by creating or adapting software and board designs and in parallel the microcontroller design team starts to work on the concept and design of the next generation (Figure 1). The challenge is to get the relevant inputs from the customers for improving the system performance of the next generation. Figure 1: F-Model There are two important analysis tasks for performance optimization. The first is to get global statistical data about performance relevant events in the system. This is in particular challenging for hard real-time systems, where most of the processing activities are triggered directly by interrupts or at least are dependant on real-time data like converted analog inputs. Only based on this information it is possible to select which of those event classes are worth investigating in detail as a second analysis task. System 978-3-9810801-3-1/DATE08 2008 EDAA

manufacturers then optimize by changing their software, the microcontroller manufacturer uses this information for quantifiable improvements of the next generation s SoC architecture and implementation. measurement) features by essentially just exchanging the microcontroller device. Figure 3 shows for instance a TC1767 evaluation board where a TC1767ED device is mounted without any further change. 2. Microcontroller Architecture Infineon s 32-Bit microcontroller product line has been recently extended by the AUDO FUTURE family (TC1797, TC1767) [2]. The TC1797 with 180 MHz CPU frequency and 4 MByte flash memory represents the stateof-the-art for high end automotive microcontrollers. Figure 2 shows a simplified block diagram. A more detailed block diagram can be found in the User s Manual (refer to predecessor TC1796 until publicly available) [2]. Figure 3: TC1767ED on Evaluation Board 1 Figure 2: Emulation Device Block Diagram 3. Tooling Solution with Emulation Device Working together with leading automotive suppliers, Infineon introduced several years ago the Emulation Device (ED) concept, with a dedicated chip for the development phase of automotive customers. An ED consists of the unchanged product chip part extended by several hundred Kbytes of overlay RAM and a powerful trigger and trace unit (Emulation Extension Chip EEC). The decision for this concept was driven by the requirement for a large overlay RAM for calibration. Calibration is used for example to optimize the parameters, which determine the characteristics of an engine (torque, exhaust gas, etc.) during the development phase of a car. To have this RAM on or connectible to every production SoC is too expensive and wasteful due to the RAM area or interface pins respectively. EDs are being offered in the same package as the production SoC and differ only in their slightly higher power consumption when the EEC part is used. This makes the ED concept very attractive for automotive suppliers, since they can deliver versions of their ECUs, with emulation (debug, calibration, Figure 4 shows the block diagram of the AUDO FUTURE family EDs. The product chip part is simplified by omitting all components which are not connected to the EEC part. The EEC consists of the MCDS (Multi-Core Debug Solution) and the Emulation Memory, which is shared between calibration overlay and trace. The access to the components on the EEC part is from a JTAG or DAP (Device Access Port) controlled bus master (ECerberus) over the Back Bone Bus. This access path is just an extension of the product chip JTAG/DAP debug port and requires no additional pins. DAP is a two pin debug interface which allows robust high-speed connection over a long cable with minimum external circuitry. It is however also possible to access the EEC from the TriCore on the product chip part over the MLI (Micro Link Interface) bridge. This means that in a later development phase a tool can communicate over a user interface like CAN or FlexRay with a monitor routine, running on TriCore, which then accesses the EEC. 1 The shown USB connection allows entry level tooling with on-board JTAG/DAP wiggler. Even USB only powered operation is possible.

DAP/ JTAG EEC TC1767/97ED Cerberus IOClient Shared Bus I/F (DMA) MLI0 EMLI SBCU BOB MCX PCP2 POB DMC TriCore POB MCDS Emulation Memory EMEM (256/512 KB SRAM) Back Bone Bus BBB Product Chip Part (SoC) LMB Bus LBCU BOB Message Sequencer Flash PMU Figure 4: Emulation Device Block Diagram EBCU The key part of the EEC, required for debugging and performance optimization is the MCDS, a configurable and scalable trigger, trace qualification, and trace compression logic block that is designed for ease of reuse with multiple heterogeneous cores. To integrate the MCDS on the ED and not on the production SoC is an obvious solution to save cost. In other applications, where an ED doesn t exist, having the MCDS on the production SoC will be possible as well. This is in particular viable for the latest highly integrated silicon technologies [3][4]. Figure 5: MCDS Trace and Trigger Blocks The MCDS can record the trace of one or several cores in parallel (Figure 5) with scalable time-stamping, conserving the order of events down to cycle level. This allows accurate tracing of concurrency-related bugs, including shared variable-access problems for the developer s viewing. Adaptation logic allows reuse of the MCDS trigger block with a range of cores, ensuring wide compatibility and effective design and tool reuse. The onchip multi-master system buses and general system states can also be traced independently from the cores. This is critical in debugging systems based on complex SoCs, because significant activity (e.g. DMA channels) occurs without any of the data passing through a processor core. Since the on-chip trace memory is limited, it is very important to be able to trigger close to the point of interest. For this purpose MCDS allows to define very complex conditions using Boolean expressions, counters and state machines [5]. It is for instance possible to trigger on events not happening in a defined time window. The two main enhancements to the MCDS from the previous to the AUDO FUTURE version are the improved cycle accurate trace and the direct measurement of performance relevant events. The cycle accurate trace is a strong requirement from customers to be able to analyze and optimize the behavior of the system on a very detailed level. The MCDS of AUDO FUTURE now fully supports this mode to the extent which is possible for a pipelined, multi-scalar, speculative processor, where the point of time, when a specific instruction has been actually executed becomes blurred. The second MCDS improvement is to tap directly performance relevant event sources like cache hits/misses, bus contentions, etc. This allows for instance gathering statistical data for such events over the time axis or identifying directly the root cause for low performance of a specific part of a program. As pointed out earlier, this information is not only important for optimizing the software, but also for the architecture improvements of the next microcontroller generation. 4. Performance Optimization Methodology One specialty of modern high-end automotive microcontrollers is the large amount of fast embedded flash memory. This embedded flash is used for application code and data and for EEPROM emulation. The main on-chip CPU(s) are executing the code directly from the embedded flash. Data structures (constant blocks, look up tables, etc.) are directly accessed as well. Even though the flash access is very fast in comparison to external flash solutions, a flash access can take several CPU cycles, depending on the CPU frequency. Due to the high amount of CPU access to the flash (data and code) the path from CPU to flash is the main lever to increase the CPU system performance for the real application. The challenge is that the behavior of this path is very complex due to code and data caches, multimaster bus, pre-fetch buffers for, and arbitration between, the code and data ports of the flash. Another, already briefly mentioned, challenge for the optimization of the microcontroller architecture is the characteristics of the automotive target systems: Most of

them are hard real-time systems, where the processing activities are triggered by interrupts or at least are dependant on real-time data like converted analog inputs. So even if customers provided their software for analysis purposes, this wouldn t help much. In addition this analysis is targeting future device. This means it has to reflect a future car with a future engine control software which is not available at the point of time when the architecture improvements for new devices are evaluated. And don t forget, different customers are using the same microcontroller in different ways to solve the same application problem. This is done by a different HW/SW split, by sometimes completely different algorithms and by using on chip resources (CPU, PCP, DMA, timer cells, etc.) in a different way. Even for one customer, the application is usually not static. The mapping changes from time to time to realize new features. With system profiling it is possible to analyze the behavior of the current application software running on existing devices. However what would be needed for designing the next generation is the software and system behavior of the future applications (new automotive standards, new HW/SW partitioning, etc.). Since this is not available the target of architecture enhancements can only be to improve on identified or expected bottle necks without negative side effects for other possible use cases of the on-chip resources. This is a safe and well accepted approach, since as already mentioned the system development is evolutionary. Customers want to reuse their software from the last microcontroller generation unchanged and adapt it only for new features for their customers. 5. Enhanced System Profiling Methodology System Profiling is the analysis of the application software on function level to find out where in the system the performance is consumed and how/why it is consumed. The why can mean for example a high CPU data access rate to an on chip resource or data structure where the how can mean for example because of non-cached access or a cached access with a poor hit rate. System profiling enables the development engineers to identify bottlenecks, hot spots etc, to weight the identified points and to concentrate on the optimization of these points that show the best performance gain/development effort ratio. Additionally system profiling allows measuring the result of the improvement quantitatively. For the customer this means an optimized hardware usage, identification of hot spots and data structures/variable that should be mapped to scratch pad memory, optimization of software algorithms for look up table access and structuring/mapping of look up tables. For the SoC architect this means: Analysis of the application profiles of the different customer applications (different access rates, access localities, access dependencies due to the different HW/SW mappings) with the target of further optimization of the hardware for the future automotive applications. For this purposes an enhanced System Profiling Methodology was introduced in the new 32 Bit Automotive microcontrollers from Infineon. This new System Profiling method supports a flexible and effective system profiling which allows the dynamic measuring of the system parameters that have an essential influence on the system performance. Essential parameters for CPU system performance of an engine control system are for example: Data/instruction cache hit/miss rates, CPU data/instruction access rates to flash/sram/scratch pad SRAMs, hit rates on flash read/pre-fetch buffer from CPU data/instruction side, CPU IPC (Instructions Per Cycle) rate, interrupt rate etc. Besides these, there are also several other parameters for the System Profiling of the PCP, DMA and other resources. With the new System Profiling method introduced in the AUDO FUTURE devices all these parameters can be dynamically and in parallel measured, non-intrusively with a configurable resolution. Dynamically, because it is essential to see all parameters values over the time line to identify the interesting spaces of time where the system performance is not optimal. All in parallel, because especially in running automotive systems it is usually not possible to repeat the same application run under identical conditions. This means it is not an option to measure different data sources one after the other. In parallel also because only when having all these data available in parallel it is possible to analyze for example the reason for a temporary poor System IPC rate in detail (high cache miss rate? Which cache? Which data or code structure? High Interrupt load? And so on). With configurable resolution and number of measured parameters to identify first the system situation where analysis has to be done (e.g. poor IPC rate, high access rate or bad instruction or data cache hit rate) and then go on with a more detailed measurement (more parameters, higher resolution). This is necessary because especially for high CPU clock frequencies the bandwidth to the debug tool is a bottle neck. The new Enhanced System Profiling method requires the generation of hit/miss/ipc/access rates directly in the onchip hardware. The ideal platform for the measurement is the ED with complex triggers, counters and trace hardware and its comparatively high amount of fast on-chip trace memory. The IPC rate of the TriCore and the PCP can be measured by using multiple counters of the ED. For each CPU one MCDS counter measures for example the instructions

executed (up to 3 within a clock cycle for TriCore ), while another counter is used for the resolution basis. Every x clock cycles, the number of executed instructions is saved as a trace message in the trace memory to get a dynamic IPC rate where x is the resolution. It is also possible to connect multiple counter structures with different resolutions: the IPC rate measurement with the high resolution, but also high trace bandwidth is only activated when the IPC rate with the low resolution is below a configurable threshold. In contrast to the measurement of the IPC rate, all other event rates can be measured with the number of executed instructions as resolution basis. Example: An instruction cache miss in clock cycle x is not a meaningful information. The information that y cache misses happened in the last 10.000 clock cycles is not meaningful as well as it is not clear whether the CPU executed mostly instructions or stalled because of accesses to high latency peripherals. Therefore cache miss/hit/access events are measured as rates relating to executed instructions. Example: 4 instruction cache misses during the last 100 executed instructions respond to an instruction cache hit rate of 96%. 6 CPU data reads from the flash within the last 100 executed instructions are identical to an CPU data flash access rate of 6%. Also for these events it is possible to connect multiple counter structures in order to increase the resolution and/or the number of measured events when a trigger condition (for example: a certain cache hit rate with a low resolution is below a threshold) is active. The sampled rate values are saved in the trace memory of the ED which acts as a buffer, and then downloaded from the microcontroller via the JTAG or DAP interface. For a detailed analysis the MCDS allows to trace the code execution and/or selected data accesses in parallel.. Another important aspect is the reduced tool interface bandwidth requirement of this new approach: Instead of sampling by the external tool at least two long counters (executed instructions, measured event, etc.) only a single trace message with the counted events is stored. This is especially important as the bandwidth of the tool interface does not scale with the CPU frequency and as the sizes of on chip trace memories are limited. choose the ones with the best ratio between performance gain on the one side and development effort and area increase on the other side. The proposed approach is sustainable for increasing clock frequencies and number of cores even with the limited bandwidth of affordable tool interfaces. 7. References [1] http://v-modell.iabg.de [2] Infineon s 32-Bit Microcontroller documentation http://www.infineon.com Product category: Microcontrollers [3] A. Mayer, H. Siebert, K. D. McDonald-Maier, Boosting Debugging Support for Complex Systems on Chip. IEEE Computer Magazine, April 2007 [4] A. Mayer, H. Siebert, C. Lipsky, Multi-Core Debug Solution IP, White paper presented by IPextreme, http://www.ip-extreme.com May 2007 [5] K. Irrgang, J. Braunes, R. Spallek, S. Weisse, T. Gröger, A New Concept for Efficient Use of Complex On-Chip Debug Solutions in SoC Based Systems. Embedded World 2006 Conference, Nürnberg, Germany 6. Conclusions The new Enhanced System Profiling approach, introduced with the AUDO FUTURE Devices makes the dynamic system behavior visible by measuring the relevant parameter rates in parallel within the target application. With this approach it is possible for the software engineer to optimize the usage of the SoC by the application software. For the SoC architect it is possible to get a complete application profile for further SoC optimizations. This allows a quantitative comparison of optimization options to