COMMUNICATION AND I/O ARCHITECTURES FOR HIGHLY INTEGRATED MPSoC PLATFORMS OUTLINE

Size: px

Start display at page:

Download "COMMUNICATION AND I/O ARCHITECTURES FOR HIGHLY INTEGRATED MPSoC PLATFORMS OUTLINE"

Felicia Haynes
5 years ago
Views:

1 COMMUNICATION AND I/O ARCHITECTURES FOR HIGHLY INTEGRATED MPSoC PLATFORMS Martino Ruggiero Luca Benini University of Bologna Simone Medardoni Davide Bertozzi University of Ferrara In cooperation with STMicroelectronics OUTLINE Overview of industrial state-of-the-art set-top-box platforms Segmented communication architecture Off-chip SDRAM memory controller Crossbenchmarking of communication architectures Single-layer architecture Many-to-many traffic pattern Many-to-one traffic pattern Multi-layer architecture Centralized high latency slave bottleneck Faster on-chip shared memory Conclusions Hints for future work 1

2 State-of-the-art set-top-box industrial platforms LX IP 1 IP 2 IP 5 Segmented communication architecture Bridge performance is critical for the system Protocol conversion/adapter Frequency, size conversion Non-blocking behaviour for the injecting bus Ability to handle multiple outstanding transactions State-of-the-art set-top-box industrial platforms IP 1 LX IP 2 IP 5 Many platforms tend to have a global performance bottleneck: memory controller for the off-chip SDRAM DRAM integration is costly Large processing data footprint requires large memories Which relation between communication and memory architecture? 2

3 Virtual platform SystemC based environment for functional simulation ARM7 ST220 ST231 Others INTERRUPT CONTROLLER PRI MEM 1 PRI MEM N INTERCONNECTION STBus - AMBA MultiLayer AMBA AXI Xpipes Off-chip SDRAM Memory Controller DMA engine SHARED MEM SEMAPHORES Modelling accuracy emphasized Cycle-accurate and bus signal-accurate Processor cores modeled at the level of their IS Simulation speed: kcycles/s (6 cores on P4 2.2 GHz) MPSIM extensions Buffer,size/freq converter for - and AXI-AXI, STBus-STBus Protocol converters: -AXI, -STBus, AXI-STBus Modelling of bridge latencies IP 1 LX Traffic generators Either native bus IF or wrappers with back-annotated latencies IP 2 IP 5 SystemC modelling and validation (memory controller, SDRAM, DDR SDRAM) 3

4 Crossbenchmarking LX LX LX LX core core core N core Communication Architecture EU IO CPU Mem1 Mem2... Mem3 MemN CPU AMBA High-speed bus Request channel AXI EU Mem Address Channel Mem Initiator Target Master Write channel Address Channel Read channel Slave Response channel STBus Write response ch. Bus performance Overall time Matdep OVERALL EXECUTION TIME AXI performs slightly worst than and STBus show similar performance 300,00% 250,00% 200,00% 150,00% 100,00% 50,00% 0,00% STBus shows better performance No of processors AXI shows better performance ST AXI 4

5 Transaction latency Single Read Matdep Bus busy Matdep (ns) ,00% 90,00% 150% 80,00% 80% 70,00% 60,00% ST 50,00% ST AXI 40,00% AXI ,00% 20,00% ,00% 0,00% AXI incurs higher transaction latency Poor performance with low bus traffic AXI scales better with increasing levels of bus congestion more complex arbiter and 5 independent channels 80% bus busy can be considered the performance crossing point of AXI Fine-grain protocol analysis STBUS low buf STBUS high buf AXI allowed by protocol features 2 wait states memory Cannot hide arbitration and slave response latency One new request processed while a response is in progress More requests processed while a response is in progress Interleaving of transfers on the internal data lanes 5

6 Single slave bottleneck TG1 TG2 TG3... TGN Communication Architecture? Single slave Execution time with single slave (on-chip shared memory) 1 wait state memory Overall exec time (clock c AXI STBus platforms Overall exec time Number of AXI ST LRU FIFO ST LRU FIFO 2 2 ST LRU FIFO 2 1 ST LRU FIFO 1 2 ST LRU FIFO 1 1 ST MSG_BSD FIFO ST MSG_BSD FIFO 1 1 AXI performs worst than and STBus (LRU) Message-based arbitration degrades performance Performance Sensitive to Direct DataPath FIFO depth The maximum I can expect is the same performance for each bus A centralized slave bottleneck is the best operating condition for 6

IP 5 Full STBus, and AXI platforms However, comparison not fair: AXI masters do not support multiple outstanding

7 FIFO-SIZE DEPENDENT STBus behaviour 1 cycle latency IN FIFO 1 for grant propagation IN FIFO 2 Next transfer readily initiated Advance sampling of next transaction Platform level centralized slave bottleneck IP 1 LX IP 2? IP 5 Full STBus, and AXI platforms However, comparison not fair: AXI masters do not support multiple outstanding transactions Protocol converter AXI-STBus is blocking on read transactions Prevents memory controller optimizations 7

8 Collapsed AXI platforms IP 1 LX IP 2 IP 4 IP 5 Overall execution time Overall exec time 3 2,5 Normalized execution time 2 1,5 1 0,5 ST AXI1 AXI2 AXI3 AXI_ramo3_su_nodo AXI_tutti_su_nodo ST_collassato 0 1 STBus leverages proprietary bridges suffers from non-split architecture and single outstanding trans. AXI poor performance with centralized slave bottleneck AXI reduced platforms slightly improve performance Now bridge performance not critical any more Best scenario (heavy load) for AXI However, AXI-STBus conversion is still critical (blocking on reads) 8

9 statistics - STBus 50,00% 45,00% 40,00% 35,00% 30,00% 25,00% 20,00% 15,00% 10,00% 5,00% 0,00% ST STbus platform 1 2 Full req0/grant1 First period 47% full 53% non-blocking (29% no requests, 24% accepting requests) FIFO almost never empty (2% out of 29%) Conclusion: Intensive memory traffic Second period 47% full 53% non-blocking (38% no requests, 15% accepting requests) FIFO often completely empty (23% out of 38%) Conclusion: bursty traffic, lower than period 1 on average Empty Removing AXI limitations AMBA Platforms (, Mixed -AXI, AXI) Protocol converter Flow bottleneck Optimizations Let us replace ProtConv+ with a fast on-chip shared memory All Platforms (, Mixed -AXI, AXI, STBus) FIFO Native bus IF Shared Memory 9

10 Platform performance 1,6 1,4 Overall exec time MOTs Prot. ineff. Fifo 1:1 1,2 Normalized exec time 1 0,8 0,6 0,4 Fifo 1:1 Fifo 16:16 ST_shared AXI2 AXI_ramo3_su_nodo ST_Shared_fifo_lmi ST_coll_sha ST_coll_sha_fifo 0,2 0 1 Best platforms Collapsed AXI has no bridge/converter overhead and takes profit by the faster memory Message-based arbitration in the STBus central node. Same improvement by adding slave FIFOs Conclusions LX LX LX... LX Communication Architecture N Communication Architecture Single slave Mem1 Mem2 Mem3 Many-to-many traffic pattern (single layer architecture): AXI/STBus competition depends on % of bus utilization AXI trades-off transaction latency with better scalability with heavy loads AXI can allocate internal data lanes on a finer granularity than STBus STBus under heavy loads can leverage crossbar instantiations Many-to-one traffic pattern (single layer architecture) The maximum transfer efficiency is imposed by the slave - 1 ws SHA MEM Max. efficiency 50%; - Mem. Controller with optimizations need to keep IN FIFO full Bus ability is to sustain that max efficiency -: pipelining control and data (OK for SHA,Not OK for ) STBus: buffering =2 for SHA, >2 for... MemN 10

11 Conclusions IP 1 LX IP 2 IP 4 IP 5 Centralized high latency slave bottleneck (multi-layer architecture): All you can require from a bus: distributed buffering & multiple outstanding transactions & split bus larger initiator-perceived bandwidth hides bus topology (and multi-layer latency) A faster on-chip memory the buffer chain from initiator-to-target does not fill up performance affected by multi-layer latency Other bus features are less critical, therefore bus differentiation is very difficult with this platform template Hints for future work Bridges relief the lack of bus scalability.. -..but introduce large complexity - Why not using bridge-free multi-hop solutions (Networks-on-Chip)? Optimize the I/O system so to take profit by the specific bus features - higher bandwidth memory controller - Multiple I/O ports - On-chip shadowing shared memory(ies) 11

12 Memory controller modelling INTERCONNECT BUS dependent BUS independent Bus Slave IF Memory Controller Should enable interfacing with many bus protocols Memory controller optimizations SDRAM Which interface architecture to the bus? - Multi-port controller with arbitration on input ports - DMA-capable controller Which memory controller optimizations? - transaction merging - variable-depth lookahead SDR SDRAM DDR SDRAM DDR2 SDRAM 12

Embedded Busses. Large semiconductor. Core vendors. Interconnect IP vendors. STBUS (STMicroelectronics) Many others!

Embedded Busses. Large semiconductor. Core vendors. Interconnect IP vendors. STBUS (STMicroelectronics) Many others! Embedded Busses Large semiconductor ( IBM ) CoreConnect STBUS (STMicroelectronics) Core vendors (. Ltd AMBA (ARM Interconnect IP vendors ( Palmchip ) CoreFrame ( Silicore ) WishBone ( Sonics ) SiliconBackPlane