Implementing Flexible Interconnect Topologies for Machine Learning Acceleration

Size: px

Start display at page:

Download "Implementing Flexible Interconnect Topologies for Machine Learning Acceleration"

Andrea Walsh
5 years ago
Views:

1 Implementing Flexible Interconnect for Machine Learning Acceleration A R M T E C H S Y M P O S I A O C T WILLIAM TSENG

2 Mem Controller 20 mm Mem Controller Machine Learning / AI SoC New Challenges 20 mm Large Chips Huge Bandwidth Mem Controller Mem Controller Mem Controller Mem Controller Mem Controller Mem Controller 31 October 2018 Copyright Arteris IP

3 Degree How many ins and outs of each node Ring deg = 2, mesh 2-4, torus = 4 Proxy for complexity or the networks cost. More degrees = more complexity at each node which means larger switches more area / energy overhead Credit On Chip Networks 2 nd edition 31 October 2018 Copyright Arteris IP

4 Bisection Bandwidth Bandwidth across a cut that splits the network into two equal parts Ring = 2 Mesh = 3 Torus = 6 Serves as a proxy for cost in terms of global wiring that will be necessary to implement the network Credit On Chip Networks 2 nd edition 31 October 2018 Copyright Arteris IP

5 Diameter Maximum distance between two nodes of the topology in terms of hops or routers that needs to be traversed Ring = 4 Mesh = 4 Torus = 2 Serves as a proxy for cost maximum latency (not counting wire delay) Credit On Chip Networks 2 nd edition 31 October 2018 Copyright Arteris IP

6 vs custom 1.) path not needed 2.) double bandwidth input needed 3.) path not needed 4.) unnecessary routers 31 October 2018 Copyright Arteris IP

output network (reading psum from the PE and sending them back to the GB) - Local Network (transferring

7 Case Study 1. MIT -- Eyeriss - 3 separate nocs - Global input network (GB to PE filter, ifmap, or psum ) - Global output network (reading psum from the PE and sending them back to the GB) - Local Network (transferring psums between the PEs) - Combined essentially it is a 14x12 mesh. - X links are used to move data from GB to PE - Y links are used to move psums between PE s 31 October 2018 Copyright Arteris IP

8 Case Study 2. KAIST Multicast NoC - Object recognition in mobile robots - Mix of fat tree, ring, multi-casting - No VC s used, wormhole routing used 31 October 2018 Copyright Arteris IP

9 FlexNoC 4 with AI Package 31 October 2018 Copyright Arteris IP

10 Interconnect IP is The Data Highway of the SoC CPU Subsystem A57 A57 A57 A57 A53 A53 A53 A53 Design-Specific Subsystems GPU Subsystem 3D Graphics IP DSP Subsystem (A/V) IP IP FlexWay Interconnect Application IP Subsystem IP IP IP FlexWay Interconnect AES 2D GR. MPEG L2 cache L2 cache IP IP IP IP IP IP Etc. Ncore Cache Coherent Interconnect Proxy $ CMC FlexNoC Non-coherent Interconnect Interchip Links Memory Scheduler CodaCache TM LLC Subsystem Interconnect WiFi GSM LTE CRI Crypto Firewall (PCF+) HDMI MIPI Display Memory Controller Wide IO LP DDR DDR3 USB 3 USB 2 3.0, 2.0 PCIe Ethernet LTE Adv. Wireless Subsystem Security Subsystem RSA-PSS Cert. Engine I/O Peripherals PMU JTAG High Speed Wired Peripherals Arteris IP FlexNoC non-coherent interconnect IP Memory Subsystem Arteris IP Ncore cache coherent interconnect IP Arteris IP Last Level Cache(s) 31 October 2018 Copyright Arteris IP

Active Arteris IP Customers (Public) Automotive Machine Learning/AI, SSD, Networking & Automation IoT, Consumer Electronics & ASIC NXP Toshiba NXP Major SSD Vendor Very Large SoC Maker Major System

11 Active Arteris IP Customers (Public) Automotive Machine Learning/AI, SSD, Networking & Automation IoT, Consumer Electronics & ASIC NXP Toshiba NXP Major SSD Vendor Very Large SoC Maker Major System OEM Major SSD Vendor Major FPGA Company Japan System OEM Japan System OEM Automotive SoC Maker Major Auto Tier-1 Major FPGA Company Major ADAS System Maker Major IP Provider Toshiba Large Drone Maker Major Industrial OEM Defense Contractor Major IP Provider Mobility Very Large SoC Maker Major Semi Fab Defense Contractor Networking Vendor Research Institute Server CPU Vendor Machine Learning SoC Company Japan Camera OEM Major Design Services Co. Major Design Services Co. Major SSD Vendor AI SoC Vendor *Logos and customer names include only publicly announced Arteris IP users as of 8 October October 2018 Copyright Arteris IP

12 FlexNoC 4 AI Technology New Features! Mesh, ring, torus Predictable data flow Homogenous accelerators Topology generation & automation Customize and optimize Flexible router arch. Large Chips Long cross-chip paths Timing closure problems Source synchronous communications VC-Links - Virtual Channels Huge Bandwidth On-chip data flow Access to off-chip memory Multicast Multi-channel HBM2 memory High bandwidth datapaths 31 October 2018 Copyright Arteris IP

13 Generate and automate topology Mesh and tori are defined as a grid with a default node size Grid can be edited for processing nodes of different sizes and aspect ratios 31 October 2018 Copyright Arteris IP

Display and edit topologies Automatically generated topologies can be edited programmatically or in GUI Physically-aware topology display engine Displays NoC elements

14 Display and edit topologies Automatically generated topologies can be edited programmatically or in GUI Physically-aware topology display engine Displays NoC elements on top of layout Gives architect: Visibility into what automation created Control to edit and optimize generated topology 31 October 2018 Copyright Arteris IP

15 OUT_S IN_S IN_N OUT_N Optimize FlexNoC-generated routers Routers are generated from from FlexNoC 4 database Built using switches, FIFOs, and other elements Architect can edit any of the FlexNoC-generated routers All, or individually Local Master IN_W OUT_W OUT_E IN_E Local Slave 31 October 2018 Copyright Arteris IP

16 20 mm Source Synchronous & VC-Links 20 mm Large Chips Very long distances 10 to 20 pipeline levels, asynchronous. Clock tree issues! Narrow channels with obstructions. Routing congestion! 31 October 2018 Copyright Arteris IP

17 Source synchronous communications DISTANCE SPANNING WHILE EASING CLOCK TREE SYNTHESIS PD2_Clk Large Chips PD1 PD2 SW SW Source synchronous pipeline for distance spanning PD2 Clock tree does not need to span long distances 31 October 2018 Copyright Arteris IP

whole packets) as an option Lightweight VC logic where you need it same switches as before (small and fast)

18 VC-Links (Virtual Channels Links) Large Chips VC ARB Allows sharing a link in a non-blocking fashion Link can be pipelined (0 to 31 pipes) for distance spanning VC arbitration has bubble prevention mechanisms (favor whole packets) as an option Lightweight VC logic where you need it same switches as before (small and fast) Help QoS when multiple links need to be merged for wire reduction reasons 31 October 2018 Copyright Arteris IP

19 BC_S0 FlexNoC Intelligent Multicast HIGHLY EFFICIENT MULTICAST BANDWIDTH AND AREA Often used for DNN weight and image map updates Broadcast station technology optimizes use of NoC bandwidth Broadcast done as close as possible to the destinations There can be any number of stations in a FlexNoC Writing to Broadcast Station will make it send in turn the writes to multiple destinations Supports posted writes for higher performance WR0 Master source (CPU, DMA, etc.) Huge Bandwidth BC_S1 T0 T2 T1WR0 WR0 T3 WR0 WR0 WR0 BC_S2 WR0 31 October 2018 Copyright Arteris IP

20 8 channels HBM2 controller HBM2 & Multichannel Memory Support Huge Bandwidth FlexNoC 4 is perfectly suited to implement the HBM2 front-end T-NIU or 16 channels interleaving between initiators (I-NIU) and targets (T-NIU) Reorder buffers (RB) Traffic aggregation / data width conversions Up to 1024 bits wide connections I-NIU II-NIU I-NIU RB RB RB T-NIU T-NIU T-NIU T-NIU T-NIU Coming in FlexNoC 4.x Support in the memory scheduler 1024 I-NIU RB T-NIU T-NIU 31 October 2018 Copyright Arteris IP

.. High Bandwidth Datapath Huge Bandwidth Up to 2048 bits wide datapath Complete buffering capabilities to deal with rate adaptation 128 bits 256 bits Store&fwd Store&fwd Rate mismatch adaptation:

21 .. High Bandwidth Datapath Huge Bandwidth Up to 2048 bits wide datapath Complete buffering capabilities to deal with rate adaptation 128 bits 256 bits Store&fwd Store&fwd Rate mismatch adaptation: narrow to wide 2048 bits Tune NoC bandwidth to application and PPA requirements Rate mismatch adaptation: wide to narrow 2048 bits FIFO 128 bits FIFO 64 bits 31 October 2018 Copyright Arteris IP

22 Accelerate AI & Machine Learning SoC Development WITH NEW FLEXNOC 4 AND THE AI PACKAGE Large Chips Huge Bandwidth Topology generation & automation Customize and optimize Flexible router architecture Source synchronous communications VC-Links - Virtual Channels Multicast Multi-channel HBM2 memory support High bandwidth datapaths Better, more complex AI systems. Sooner. 31 October 2018 Copyright Arteris IP

23 31 October 2018 Copyright Arteris IP

Ncore Cache Coherent Interconnect

Ncore Cache Coherent Interconnect Ncore Cache Interconnect Technology Overview, 24 May 2016 Craig Forrest Chief Technology Officer David Kruckemyer Chief Hardware Architect Copyright 2016 Arteris 24 May 2016 Contents About Arteris Caches,