www.thalesgroup.com Mastering The Behavior of Multi-Core Systems to Match Avionics Requirements Hicham AGROU, Marc GATTI, Pascal SAINRAT, Patrice TOILLON {hicham.agrou,marc-j.gatti, patrice.toillon}@fr.thalesgroup.com {agrou,sainrat}@irit.fr
Summary Introduction
Summary Introduction Multi-Core Architectures State of Art Academic Processor, COTS Architecture,
Summary Introduction Multi-Core Architectures State of Art Academic Processor, COTS Architecture, Evaluation of QorIQ Platform (P4080) from Freescale Procedures Results
Summary Introduction Multi-Core Architectures State of Art Academic Processor, COTS Architecture, Evaluation of QorIQ Platform (P4080) from Freescale Procedures Results THALES Avionics AMISIS Concept First performance Results
INTRODUCTION
Evolution of Avionic Embedded Computers
Pre-IMA Generation Evolution of Avionic Embedded Computers
Evolution of Avionic Embedded Computers 90 s A330/A340 1 unit = 1 function Intel, DSP 80 s A300/A320/B737 1 unit = 1 function Intel, 68010 Pre-IMA Generation 70 s Concorde 1 unit = 1 function Analog only
IMA Generation Evolution of Avionic Embedded Computers
Evolution of Avionic Embedded Computers 2000/2010 0/2010 A380/B7877 3 to 5 functions/unit PowerPC, A653+RTOS Generalization of IMA PowerPC ISA s Legacy IMA Generation ~1995 B777 2 to 3 functions/unit AMD29050 1st Generation of IMA
Evolution of Avionic Embedded Computers Next generation: IMA on Multicore? 10 or more functions/unit IMA Generation
Current Avionics requirements Safety Certification Determinism Level Failure Condition Failure Rate A Catastrophic <1 in 10 9 hours of flight B Hazardous <1 in 10 7 hours of flight C Major <1 in 10 5 hours of flight D Minor <1 in 10 3 hours of flight No Effect E Partitioning Spatial Time & Space isolation Temporal Application #1 Application #2 Application Application #N Application Programing Interface (ARINC 653) Communication, synchronisation services Time, fault, and task management Operating System Layer (ARINC 653) Partition scheduling Package Driver Processor Hardware
Future Avionics Requirements Increase Performance Host more functions per unit Improvement ratio Performance / Watts Reduce Environmental Footprint Less energy consumption Reduce number of units
Future Avionics Requirements Increase Performance Host more functions per unit Improvement ratio Performance / Watts Reduce Environmental Footprint Less energy consumption Reduce number of units Smaller Modules More embedded functions per chip MULTI-CORE seems to be the solution
Academic & COTS Architectures MULTI-CORE ARCHITECTURES STATE OF ART
Multi-Core Architectures State of The Art Academic Predictable Multi-core Processor: MERASA & PRET processors COTS architecture: IBM Cell, Freescale s MPC8641D & QorIQ Platform Local memories (scratchpads & caches): Best cache policy (for analyzability), Cache Analysis (optimization to reduce cache pollution), Shared Cache Strategy to reduce interferences Interconnect Element: Shared Bus (bounding access time), Ring protocols, CoreNet, & Data Path Accelerator Architecture Hicham Agrou, Marc Gatti, Pascal Sainrat, Patrice Toillon. A Design Approach for Predictable and Efficient Multi-Core for Avionics. In: Digital Avionics Systems Conference (DASC 2011), Seattle, 16/10/2011-20/10/2011, Vol. 7D3, IEEE, p. 1-11; October 2011.
Multi-Core Architectures State of The Art Academic Predictable Multi-core Processor: MERASA & PRET processors COTS architecture: IBM Cell, Freescale s MPC8641D & QorIQ Platform Local memories (scratchpads & caches): Best cache policy (for analyzability), Cache Analysis (optimization to reduce cache pollution), Shared Cache Strategy to reduce interferences Interconnect Element: Shared Bus (bounding access time), Ring protocols, CoreNet, & Data Path Accelerator Architecture Lack of studies for multicore architecture in avionics Focus on core level & local memory evolutions At interconnect level, no partitioning warranty is given
Multi-Core Architectures State of The Art Academic Predictable Multi-core Processor: MERASA & PRET processors COTS architecture: IBM Cell, Freescale s MPC8641D & QorIQ Platform Local memories (scratchpads & caches): Best cache policy (for analyzability), Cache Analysis (optimization to reduce cache pollution), Shared Cache Strategy to reduce interferences Interconnect Element: Shared Bus (bounding access time), Ring protocols, CoreNet, & Data Path Accelerator Architecture Our approach is to focus on a smart interconnect to manage all transactions in a multi-core system Lack of studies for multicore architecture in avionics Focus on core level & local memory evolutions At interconnect level, no partitioning warranty is given
Procedure & Results EVALUATION OF A QORIQ PLATFORM (P4080)
Evaluation of a QorIQ Platform (P4080) from Freescale Objective Definition of usage profiles compatible with temporal aspects of avionics constraints Procedure To find thresholds of transaction density beyond which CoreNet TM introduces low-performance or/and an abnormal behavior A(0,1) A(x,n) Our Procedure A(0) A(n) OS(0) OS(x) OS OR Hypervisor (s) Stress Application HW HW HW SMP Configuration AMP or AMP+SMP Configuration Bare-metal Configuration
P4080 Processor : test perimeter P4080 DS Cores : 1,2 GHz DDR3 # 1 :1200 MHz P4080 CoreNet : 600 MHz
Procedure This core performs a transaction and measures its duration
Procedure > The Implemented Transaction Initiators DMA Controllers Flooding Cores Each measure is in AREA 1 of DDR 3
Test Perimeter > The Implemented Memories
Procedure
Procedure > Platform s Initialisation 1
Procedure > Transactions of Flooding Cores Flooding Core 2
Procedure > Direct Memory Accesses 3 2
Procedure > Transaction of the Witness 3 2 4
Procedure > Storage of the Transaction Duration 5 If 0 Flooding Core in step 2
Procedure > Storage of the Transaction Duration 5 If 1 Flooding Core in step 2
Procedure > Storage of the Transaction Duration 5 If 2 Flooding Cores in step 2
Procedure > Storage of the Transaction Duration 5 If 3 Flooding Cores in step 2
Procedure > Storage of the Transaction Duration 5 If 4 Flooding Cores in step 2
Procedure > Storage of the Transaction Duration 5 If 5 Flooding Cores in step 2
Procedure > Storage of the Transaction Duration 5 If 6 Flooding Cores in step 2
Procedure > Storage of the Transaction Duration 5 If 7 Flooding Cores in step 2
Scenario 1
Scenario 1 Similar results when testing With no/1/2 DMA controller(s) 1 to 8 active cores Parameters 512 sizes of transaction (witness core) have been tested ~ 16 hours of test for each scenario Each test is repeated ~ 10 000 times
Scenario 1
Scenario 2
Scenario 1 & 2 > Results Scenario 1 Scenario 2
Scenario 1 & 2 > Results Scenario 1 Scenario 2 These results show that DMAs in DDR3 (memory controller 1) can increase transaction latency of the witness core in CPC 1
Scenario 3 DMA Load & Store transactions
Scenario 3 > Results
Scenario 3 > Results Several transaction durations of 7-core configuration are not saved
Scenario 4 All transaction initiators perform their transactions in DDR3 (AREA 2)
Scenario 4 > Results
Scenario 4 > Results All transaction durations are saved in DDR 3 (memory controller 1) and their values increase
Scenario 5 All transaction initiators perform their transactions in DDR3 s AREA 1
Scenario 5 > Results
Scenario 5 > Results Several transaction durations are not saved in DDR3 and their value have globally decreased
Scenario 6
Scenario 6
Scenario 6 : Delaying the backup of each measure 0 µs 3,3 µs 4,3 µs 5 µs
Scenario 6 : Delaying the measure storage 0 µs 3,3 µs 4,3 µs 5 µs Delaying the measure storage into AREA 1 (DDR3 of memory controller 1) makes the phenomenon disappear
Concept, Features & First Performance Results THALES AVIONICS AMISIS
AMISIS Architecture Avionics Multi-core Interconnect for Scalable Integrated System Concept Mastering temporal and spatial behavior of each transaction initiator, Ensuring that each transaction initiator will respect an insertion contract Implementation of Hardware Services Maximal Transaction Delay Measure(Max-TDM) t t Maximal Technology Transaction Delay Minimal Technology Transaction Delay Minimal Transaction Delay Measure (Min-TDM) COTS Black Box Approach t CUSTOM White Box Approach
AMISIS Architecture Avionics Multi-core Interconnect for Scalable Integrated System Concept Mastering temporal and spatial behavior of each transaction initiator, Ensuring that each transaction initiator will respect an insertion contract Implementing Hardware Services Experimentation s Implementation Objective : Definition of temporal impacts of AMISIS in transaction durations Procedure : Design of AMISIS units in FPGA Measure of AMISIS units temporal impacts
First Performance Results of AMISIS AMISIS 125 MHz Memory 400 MHz Load 08 Load 16 Load 32
First Performance Results of AMISIS 1 cycle 2 cycles AMISIS 125 MHz Memory 400 MHz
First Performance Results of AMISIS AMISIS 125 MHz Memory 400 MHz
First Performance Results of AMISIS AMISIS 125 MHz Memory 400 MHz 1 cycle 2 cycles
CONCLUSION
Conclusion Processing Elements use embedded hardware features that worsen Worst Case Execution Time Estimation
Conclusion Processing Elements use embedded hardware features that worsen Worst Case Execution Time Estimation
Conclusion Processing Elements use embedded hardware features that worsen Worst Case Execution Time Estimation Dynamic Branch Prediction Out-Of-Order Speculative Execution Larger Pipelines More Ways Policy Replacement : PLRU, FIFO, Round Robin
Conclusion Processing Elements use embedded hardware features that worsen Worst Case Execution Time Estimation Mastered Dynamic Branch Prediction Out-Of-Order Speculative Execution Larger Pipelines More Ways Policy Replacement : PLRU, FIFO, Round Robin
Conclusion Processing Elements use embedded hardware features that worsen Worst Case Execution Time Estimation Multi-core context implies concurrent accesses that require management to ensure proper spatial and temporal partitioning
Conclusion Processing Elements use embedded hardware features that worsen Worst Case Execution Time Estimation Multi-core context implies concurrent accesses that require management to ensure proper spatial and temporal partitioning Scenario 2 shows that increasing the number of shared resources sometimes increases transaction durations Scenario 1 Scenario 2
Conclusion Processing Elements use embedded hardware features that worsen Worst Case Execution Time Estimation Multi-core context implies concurrent accesses that require management to ensure proper spatial and temporal partitioning Scenario 2 shows that increasing the number of shared resources sometimes increases transaction durations Scenario 3 and 5 show that unexpected phenomena may appear
Conclusion Processing Elements use embedded hardware features that worsen Worst Case Execution Time Estimation Multi-core context implies concurrent accesses that require management to ensure proper spatial and temporal partitioning Scenario 2 shows that increasing the number of shared resources sometimes increases transaction durations Scenario 3 and 5 show that unexpected phenomena may appear First performance results of THALES Avionics AMISIS show that the temporal impact of its controls is negligible
Thank You for Your Attention