Technical Note on NGMP Verification. Next Generation Multipurpose Microprocessor. Contract: 22279/09/NL/JK

Similar documents
Current and Next Generation LEON SoC Architectures for Space

Next Generation Multipurpose Microprocessor. Activity Overview

Next Generation Multi-Purpose Microprocessor

ESA Contract 18533/04/NL/JD

SCOC3 (Spacecraft Controller On Chip) ESTEC, Noordwijk, 7 th and 8 th March 2007

LEON4: Fourth Generation of the LEON Processor

COMPARISON BETWEEN GR740, LEON4-N2X AND NGMP

Processor and Peripheral IP Cores for Microcontrollers in Embedded Space Applications

Building Blocks For System on a Chip Spacecraft Controller on a Chip

DEVELOPING RTEMS SMP FOR LEON3/LEON4 MULTI-PROCESSOR DEVICES. Flight Software Workshop /12/2013

CCSDS Unsegmented Code Transfer Protocol (CUCTP)

CCSDS Time Distribution over SpaceWire

GR712RC A MULTI-PROCESSOR DEVICE WITH SPACEWIRE INTERFACES

FPQ6 - MPC8313E implementation

Introduction to LEON3, GRLIB

GR740 Technical Note on Benchmarking and Validation

EMC2. Prototyping and Benchmarking of PikeOS-based and XTRATUM-based systems on LEON4x4

Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models. Jason Andrews

Multi-core microcontroller design with Cortex-M processors and CoreSight SoC

Multi-DSP/Micro-Processor Architecture (MDPA) Paul Rastetter Astrium GmbH

Atmel AT697 validation report

ARM Processors for Embedded Applications

GR740 Technical Note on Benchmarking and Validation

Booting a LEON system over SpaceWire RMAP. Application note Doc. No GRLIB-AN-0002 Issue 2.1

V8uC: Sparc V8 micro-controller derived from LEON2-FT

SoC Overview. Multicore Applications Team

The CoreConnect Bus Architecture

Applying the Benefits of Network on a Chip Architecture to FPGA System Design

Interconnects, Memory, GPIO

Introduction Electrical Considerations Data Transfer Synchronization Bus Arbitration VME Bus Local Buses PCI Bus PCI Bus Variants Serial Buses

System-On-Chip Design with the Leon CPU The SOCKS Hardware/Software Environment

Executive Summary Next Generation Microprocessor (NGMP) Engineering Models Product code: GR740

Designing Embedded Processors in FPGAs

Copyright 2016 Xilinx

Zynq Architecture, PS (ARM) and PL

SoC Platforms and CPU Cores

ADVANCED ELECTRONIC SOLUTIONS AVIATION SERVICES COMMUNICATIONS AND CONNECTIVITY MISSION SYSTEMS

Rad-Hard Microcontroller For Space Applications

Designing with ALTERA SoC Hardware

Avnet, Xilinx ATCA PICMG Design Kit Hardware Manual

Analyze system performance using IWB. Interconnect Workbench Dave Huang

Evaluation of Soft-Core Processors on a Xilinx Virtex-5 Field Programmable Gate Array

Migrating from the UT699 to the UT699E

Multi-DSP/Micro-Processor Architecture (MDPA)

Modeling Performance Use Cases with Traffic Profiles Over ARM AMBA Interfaces

A Flexible SystemC Simulator for Multiprocessor Systemson-Chip

Design of Next Generation CPU Card for State of the Art Satellite Control Application

Advanced Computing, Memory and Networking Solutions for Space

Zynq-7000 All Programmable SoC Product Overview

KeyStone C665x Multicore SoC

CHAPTER 4 MARIE: An Introduction to a Simple Computer

DRPM architecture overview

FPQ9 - MPC8360E implementation

Embedded Systems: Hardware Components (part II) Todor Stefanov

Final Presentation. Network on Chip (NoC) for Many-Core System on Chip in Space Applications. December 13, 2017

ESL-Based Full System Simulation Platform

Hello and welcome to this Renesas Interactive module that provides an architectural overview of the RX Core.

Place Your Logo Here. K. Charles Janac

Massively Parallel Processor Breadboarding (MPPB)

Hello, and welcome to this presentation of the STM32L4 System Configuration Controller.

LEON3-Fault Tolerant Design Against Radiation Effects ASIC

Hardware Design. MicroBlaze 7.1. This material exempt per Department of Commerce license exception TSU Xilinx, Inc. All Rights Reserved

Practical Hardware Debugging: Quick Notes On How to Simulate Altera s Nios II Multiprocessor Systems Using Mentor Graphics ModelSim

AN OPEN-SOURCE VHDL IP LIBRARY WITH PLUG&PLAY CONFIGURATION

RiceNIC. Prototyping Network Interfaces. Jeffrey Shafer Scott Rixner

Bus AMBA. Advanced Microcontroller Bus Architecture (AMBA)

Product Technical Brief S3C2412 Rev 2.2, Apr. 2006

Migrating RC3233x Software to the RC32434/5 Device

Fujitsu SOC Fujitsu Microelectronics America, Inc.

Next Generation Microprocessor Functional Prototype SpaceWire Router Validation Results

Test and Verification Solutions. ARM Based SOC Design and Verification

Development an update. Aeroflex Gaisler

Product Technical Brief S3C2413 Rev 2.2, Apr. 2006

picojava I Java Processor Core DATA SHEET DESCRIPTION

ARM Cortex-M4 Architecture and Instruction Set 1: Architecture Overview

RM3 - Cortex-M4 / Cortex-M4F implementation

Lab-2: Profiling m4v_dec on GR-XC3S National Chiao Tung University Chun-Jen Tsai 3/28/2011

ECE 551 System on Chip Design

GR740 Processor development

Real Time Trace Solution for LEON/GRLIB System-on-Chip. Master of Science Thesis in Embedded Electronics System Design ALEXANDER KARLSSON

Input/Output Problems. External Devices. Input/Output Module. I/O Steps. I/O Module Function Computer Architecture

ESA IPs & SoCs developments

Chapter Seven Morgan Kaufmann Publishers

Developing a LEON3 template design for the Altera Cyclone-II DE2 board Master of Science Thesis in Integrated Electronic System Design

INT 1011 TCP Offload Engine (Full Offload)

The ARM10 Family of Advanced Microprocessor Cores

The Challenges of System Design. Raising Performance and Reducing Power Consumption

Design of Embedded Hardware and Firmware

Characterizing the Performance of SpaceWire on a LEON3FT. Ken Koontz, Andrew Harris, David Edell

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

CoreTile Express for Cortex-A5

A Next Generation Home Access Point and Router

Embedded Fingerprint Verification and Matching System

Organisasi Sistem Komputer

Page 1 SPACEWIRE SEMINAR 4/5 NOVEMBER 2003 JF COLDEFY / C HONVAULT

Analyzing and Debugging Performance Issues with Advanced ARM CoreLink System IP Components

Negotiating the Maze Getting the most out of memory systems today and tomorrow. Robert Kaye

Generic Model of I/O Module Interface to CPU and Memory Interface to one or more peripherals

GR716 Single-Core LEON3FT Microcontroller. Cobham Gaisler AMICSA 2018

ELCT 912: Advanced Embedded Systems

Transcription:

NGP-EVAL-0013 Date: 2010-12-20 Page: 1 of 7 Technical Note on NGP Verification Next Generation ultipurpose icroprocessor Contract: 22279/09/NL/JK Aeroflex Gaisler AB EA contract: 22279/09/NL/JK Deliverable: D11

NGP-EVAL-0013 Date: 2010-12-20 Page: 2 of 7 TABLE OF CONTENT 1 INTRODUCTION... 3 1.1 cope of the document... 3 1.2 Overview of design... 3 2 NGP VERIFICATION... 4 2.1 Limitations... 4 2.2 Operating systems and drivers... 4 2.3 Concurrent P and AP Configurations... 4 2.4 I/O Performance Evaluation... 4 2.5 emory egregation Capability... 5 2.6 ulti-core Debugging upport... 5 2.7 ample benchmarks... 6 3 UARY... 7 Aeroflex Gaisler AB EA contract: 22279/09/NL/JK Deliverable: D11

NGP-EVAL-0013 Date: 2010-12-20 Page: 3 of 7 1 INTRODUCTION 1.1 cope of the document This document is a technical note describing part of the verification performed during the architectural design phase of the Next Generation ultipurpose icroprocessor (NGP). The NGP is developed within an activity initiated by the European pace Agency under ETEC contract 22279/09/NL/JK. The work has been performed by Aeroflex Gaisler AB, Göteborg, weden. 1.2 Overview of design 32-bit APB @ 400 Hz 32-bit APB @ 400 Hz = aster interface(s) = lave interface(s) = noop interface IRQ(A)P Interrupt bus emory crubber tatistics LEON4 TAT. UNIT IRQCTRL FPU FPU U U IRQCTRL FPU FPU U U AHB/AHB DU AHB/APB TRACE Debug bus 32-bit AHB @ 400 Hz RAP DCL 64-bit DRA DDR2-800/ DR-PC100 PRO IO 8/16-bit DDR2 DDR2 AND AND DRA DRA CTRLs CTRLs PRO & IO CTRL CLKGATE emory bus L2 128-bit AHB @ 400 Hz Cache AHB/AHB 32-bit AHB @ 400 Hz lave IO bus AHB Timers 1-4 AHB/APB tatus aster Processor bus 128-bit AHB @ 400 Hz DA Target AHBTRACE AHB AHB tatus IOU aster IO bus 32-bit AHB @ 400 Hz HL HL pw router HL HL Ethernet Ethernet JTAG UB DCL Timers 0 watchdog UART GPIO UART Arbiter Figure 1: Overview block diagram The system consists of five AHB buses; one 128-bit Processor AHB bus, one 128-bit emory AHB bus, two 32-bit I/O AHB buses and one 32-bit Debug AHB bus. The Processor AHB bus includes four cores connected to a shared L2 cache. The emory bus is located between the L2 cache and the main external memory interfaces, DDR2 and DRA interfaces on shared pins, and includes a memory scrubber. The I/O bus has been split into two separate buses where all slave interfaces have been placed on one of the buses (lave I/O AHB bus) and all master interfaces have been placed on the other bus (aster I/O AHB bus). The aster I/O AHB bus connects to the Processor AHB bus via an AHB/AHB bridge that provides access restriction and address translation (IOU) functionality. This AHB/AHB bridge also has a master interface connecting it to the emory AHB bus. The AHB master interface to use when propagating traffic from a core on the aster I/O AHB bus is dynamically configurable. The two I/O buses include all peripheral units such as, Ethernet ACs, and pacewire interfaces. The dedicated 32-bit Debug AHB bus connects one debug support unit (DU), JTAG, Ethernet, UB and pacewire debug links, AHB and trace buffers and also provides a direct link to the LEON4 statistics unit that contains performance counters. The Debug AHB bus allows for non-intrusive debugging through the DU and direct access to the complete system. The target frequency of the processor cores and on-chip buses is 400 Hz, but depends ultimately on the implementation technology. Aeroflex Gaisler AB EA contract: 22279/09/NL/JK Deliverable: D11

NGP-EVAL-0013 Date: 2010-12-20 Page: 4 of 7 2 NGP VERIFICATION Variations of the NGP design were implemented on several FPGA prototyping boards. The FPGA boards are off-the-shelf products from Aeroflex Gaisler, ilinx and ynopsys. The FPGA prototyping approach has been processor driven, based on the LEON3/4 software environment. This facilitates re-use of existing resources and ensures that the overall objective of a processor-controlled device is achieved. The software development tools are based on the well-known LEON cross compilers, developed by Aeroflex Gaisler. The GRON debug monitor was used for communication between the host system and target system. 2.1 Limitations ince off-the-shelf development boards were used for FPGA prototyping it was expected that no board could be found that allowed using all interfaces of the NGP design. In addition to this the NGP design requires a large FPGA if the full design is to be implemented. The clock frequencies of the FPGA prototypes were also lower than the target frequency of the final chip. Other deviations from the final AIC design include: acros, such as PLL and DDR2 PHY DDR2 DRA and DR DRA on shared pins was not possible prototype on FPGA everal of the core buffers will be implemented with flip-flops in the final design. On FPGA this may lead to the design growing to large and RA blocks may be used instead. The ERDE (HL) link can not be verified. 2.2 Operating systems and drivers Test suites and custom test application were run to demonstrate the hardware and to validate the porting work made to the operating systems. RTE 4.10, ecos 2.0, VxWorks 6.7 and Linux 2.6 were demonstrated to be operating correctly. Test were also run with the KPRO2 bootloader creation tool. 2.3 Concurrent P and AP Configurations The NGP system has been designed with extended support for running AP configurations. The FPGA prototyping included two different tests, using two different operating systems; RTE and Linux. These tests demonstrate: P system: Linux P on three CPUs Heterogeneous AP system: Linux on three CPUs and RTE on one CPU Homogeneous AP system: RTE on two CPUs The two tests successfully demonstrated how resources of the NGP design can be shared in a heterogeneous AP system, in a P ystem and in a homogeneous AP system. It also demonstrated the flexibility of the extended multiprocessor interrupt controller controller: how different CPUs can be routed to a unique interrupt controller, and how multiple CPUs can be routed to the same interrupt controller. 2.4 I/O Performance Evaluation Performance and throughput measurements are typically performed on FPGA prototypes due to the prohibitively long simulation runs required to measure throughput. The results from FPGA prototyping are then scaled if the FPGA prototype is not representative of the final design. Aeroflex Gaisler AB EA contract: 22279/09/NL/JK Deliverable: D11

NGP-EVAL-0013 Date: 2010-12-20 Page: 5 of 7 For NGP, there are some difficulties involved in scaling the results from FPGA prototypes. Due to the placement of masters and main memory in the NGP bus topology, masters on the aster I/O bus experience significant latencies on accesses made to main memory. In a typical LEON system, the AHB masters are placed on the same bus as the processors and main memory. In the NGP system, masters must perform accesses over the IOU and possibly over the L2 cache in order to fetch data from main memory. Traversing the IOU and L2 cache adds latency cycles to each single access or to each block of burst accesses. These latency cycles, combined with latency from external memory is the main limitation to I/O throughput (apart from limitations in the bus fabric itself; arbitration cycles, data bus width etc.). On a prototype system with a system frequency of 50 Hz, the latency clock cycles will give a latency that is eight times higher than the latency time in the final system. This is fine if input traffic can be scaled to be eight times slower. This is not always possible. One example applicable to the NGP system is gigabit Ethernet. To test I/O throughput, one of the test cases would typically involve transferring data with TCP/IP. When the system cannot handle the stream of data, the Ethernet controller will experience overruns and packets will be dropped. This will lead to packets being re-sent, possibly in a slower rate. As soon as packets are being re-sent it becomes difficult to scale the results. As the NGP FPGA prototypes may not be representative for throughput tests, simulations were performed in order to build a view of the design's performance when running at the target frequency. Traffic was generated via the pacewire router and Ethernet cores and tests were run on several configurations, including: L2 cache disabled, L2 cache enable, L2 cache with fault-tolerance enabled. The results are summarised in the table below. The first column lists the configuration, the second column (1x Eth) lists the results for running the GRETH_GBIT throughput test on the first Ethernet controller. The column 2x Eth lists the results, per Ethernet core, from running the GRETH_GBIT throughput test on both controllers. Next (pw) the results from running the pacewire router throughput test is shown divided per ABA port and a total for all ABA ports. The columns under combined test shows the results when running all tests simultaneously. Configuration 1x Eth 2x Eth pw Combined test Eth 0 Eth 1 Per port Total Eth 0 Eth 1 pw/ port pw total L2 cache disabled 1.2 Gb/s 730 b/s 790 b/s 394 b/s 1.57 Gb/s 438 b/s 480 b/s 216 b/s 865 b/s L2 cache enabled 1.7 Gb/s 1.7 Gb/s 1.7 Gb/s 1.56 Gb/s 6.25 Gb/s 1.4 Gb/s 1.5 Gb/s 1 Gb/s 4 Gb/s L2 cache FT enabled 1.7 Gb/s 1.7 Gb/s 1.7 Gb/s 1.5 Gb/s 6.1 Gb/s 1.4 Gb/s 1.4 Gb/s 1.5 Gb/s 3.9 Gb/s Table 1: Throughput of 400 Hz system 2.5 emory egregation Capability The NGP design provides separation capabilities via processor memory management units (Us) and a I/O emory anagement Unit (IOU). These capabilities were demonstrated with a test consisting of two RTE images, running in parallel trying to access allowed and protected memory areas using their respective CPU (load/store instructions). Attempts to access protected areas were also done using RAP commands to the pacewire router's ABA ports. The attempts to access protected areas were effectively blocked by the processor Us and the IOU. 2.6 ulti-core Debugging upport To demonstrate multi-core debugging support, Aeroflex Gaisler's debug monitor GRON is used to control and view the state of multiple CPU. Aeroflex Gaisler AB EA contract: 22279/09/NL/JK Deliverable: D11

NGP-EVAL-0013 Date: 2010-12-20 Page: 6 of 7 Four RTE images, one per CPU, are loaded to memory. Initialisation of CPU individual stack pointers and entry points are demonstrated using GRON commands from a batch script. Once initialised and booted, the CPUs communicate with each other using shared memory. Each CPU spins in a tight busy loop waiting for a message to be passed on to the next CPU. The test demonstrated: tack pointer per CPU Entry point per CPU Viewing register file per CPU Viewing instruction trace per CPU Viewing bus transactions per CPU Instruction and bus transactions are sampled using the same counter making it possible to determine timing relative to each other Hardware watch points per CPU Hardware break points per CPU Continuing execution of all CPUs A practical example how to inspect D-cache snooping in action by inspecting the instruction traces of multiple CPUs 2.7 ample benchmarks To compare the performance or the NGP to previous LEON2 and LEON3 systems, a small collection of benchmarks has been developed. These benchmarks can be compiled with the BCC tool-chain and run on systems without an O and U. While not providing an exhaustive performance profile, these benchmarks still provide interesting compare points in the development of the LEON processor. The benchmarks have been run on the following systems: AT697, UT699, GR712RC, NGP. The systems have the following processor configuration: AT697: LEON2FT, 32K + 16K cache, 5-clock UL, load delay 1, eiko FPU UT699, LEON3FT V1, 8K + 8K cache, 5-clock UL, load del 2, GRFPU GR712: LEON3FT V2, 16K + 16K cache, 5-clock UL, load del 1, GRFPU, branch pre. NGP:, 16K + 16K L1 cache, 2-clock UL, load del 1, GRFPU, 128K L2 cache The following benchmarks will be run: 164.gzip (from the PEC CPU2000 suite) 176.gcc (from the PEC CPU2000 suite) 256.bzip2 (from the PEC CPU2000 suite) AOC benchmark Basicmath_large Coremark-1.0 Dhrystone-2.0 Linpack-DP Whetstone All benchmarks have been compiled with gcc-4.4.2 tuned for PARC V8. All systems were clocked at 50 Hz during the tests, using 32-bit DRA (LEON2/3) or 64-bit DDR2 (NGP). Table 2 shows the performance figures relative to AT697. Aeroflex Gaisler AB EA contract: 22279/09/NL/JK Deliverable: D11

NGP-EVAL-0013 Date: 2010-12-20 Page: 7 of 7 Benchmark AT697 UT699 GR712RC NGP 164.gzip 1 0.94 1.1 1.31 176.gcc 1 0.79 0.97 1.3 256.bzip2 1 0.93 1.06 1.33 AOC 1 1.2 1.52 1.79 Basicmath 1 1.3 1.46 1.62 Coremark, 1 thread 1 0.89 1.09 1.21 Coremark, 4 threads 1 0.89 2.05 4.59 Drystone 1 0.94 1.05 1.39 Drystone, 4 instances 1 0.94 1.61 4.81 Linpack 1 1.2 1.26 1.71 Whetstone 1 1.94 2 2.22 Whetstone, 4 instances 1 1.94 3.7 8.68 Table 2: Performance comparison The table shows that the LEON4/NGP system has approximately 30% better CPI than AT697 on integer benchmarks, and up to 100% better CPI on floating-point benchmarks. The Coremark benchmark can also be run multi-threaded, which shows on the high 4-thread results for GR712RC and NGP. The benchmark will fit in the L1 cache, and therefore scales almost linearly with number of cores. All benchmarks were run using the BCC runtime. Using the Linux P O, multiple instances of Dhrystone and Whetstone was run on GR712RC and NGP. It shows that performance scales better on NGP than GR712RC, mostly due to wider data-paths and the presence of an L2 cache. 3 UARY The system has been verified by means of VHDL simulation and FPGA prototyping covering. The objective during the architectural design phase has been to verify the processing capabilities of the NGP system. The system level tests of fault-tolerance capabilities have been deferred to the Preliminary EE validation to be performed during the detailed design phase. Preliminary EE validation is foreseen to include running the same set of tests, but also with error injection enabled. Test results show that porting of all operating systems and driver development has been successful. Performance measurements indicate that the NGP design is within specification. Providing enough bandwidth to satisfy four 6.25 Gb/s high-speed serial links will be challenging however. Aeroflex Gaisler AB EA contract: 22279/09/NL/JK Deliverable: D11