Part IV: 3D WiNoC Architectures

Similar documents
3D WiNoC Architectures

A Building Block 3D System with Inductive-Coupling Through Chip Interfaces Hiroki Matsutani Keio University, Japan

Low-Latency Wireless 3D NoCs via Randomized Shortcut Chips

Overlaid Mesh Topology Design and Deadlock Free Routing in Wireless Network-on-Chip. Danella Zhao and Ruizhe Wu Presented by Zhonghai Lu, KTH

A Case for Wireless 3D NoCs for CMPs

Network on Chip Architecture: An Overview

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation

Low-Power Interconnection Networks

Interconnect Challenges in a Many Core Compute Environment. Jerry Bautista, PhD Gen Mgr, New Business Initiatives Intel, Tech and Manuf Grp

MUCCRA-CUBE: A 3D DYNAMICALLY RECONFIGURABLE PROCESSOR WITH INDUCTIVE-COUPLING LINK S. Saito, Y. Kohama, Y. Sugimori, Y. Hasegawa, H.

Future of Interconnect Fabric A Contrarian View. Shekhar Borkar June 13, 2010 Intel Corp. 1

Network on Chip Architectures BY JAGAN MURALIDHARAN NIRAJ VASUDEVAN

Package level Interconnect Options

NoC Test-Chip Project: Working Document

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013

ECE/CS 757: Advanced Computer Architecture II Interconnects

Fault Tolerant and Secure Architectures for On Chip Networks With Emerging Interconnect Technologies. Mohsin Y Ahmed Conlan Wesson

Prediction Router: Yet another low-latency on-chip router architecture

INTERCONNECTION NETWORKS LECTURE 4

Hybrid On-chip Data Networks. Gilbert Hendry Keren Bergman. Lightwave Research Lab. Columbia University

OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel

Networks for Multi-core Chips A A Contrarian View. Shekhar Borkar Aug 27, 2007 Intel Corp.

Network-on-chip (NOC) Topologies

CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers

Synchronized Progress in Interconnection Networks (SPIN) : A new theory for deadlock freedom

Network-on-Chip Architecture

4. Networks. in parallel computers. Advances in Computer Architecture

Packet Switch Architecture

Packet Switch Architecture

CAD System Lab Graduate Institute of Electronics Engineering National Taiwan University Taipei, Taiwan, ROC

RHiNET-3/SW: an 80-Gbit/s high-speed network switch for distributed parallel computing

3D systems-on-chip. A clever partitioning of circuits to improve area, cost, power and performance. The 3D technology landscape

PERFORMANCE EVALUATION OF WIRELESS NETWORKS ON CHIP JYUN-LYANG CHANG

Re-Examining Conventional Wisdom for Networks-on-Chip in the Context of FPGAs

Design and Test Solutions for Networks-on-Chip. Jin-Ho Ahn Hoseo University

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Lecture 22: Router Design

Fast Flexible FPGA-Tuned Networks-on-Chip

A Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on

Escape Path based Irregular Network-on-chip Simulation Framework

COMPARATIVE PERFORMANCE EVALUATION OF WIRELESS AND OPTICAL NOC ARCHITECTURES

udirec: Unified Diagnosis and Reconfiguration for Frugal Bypass of NoC Faults

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics

Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays

NoC Simulation in Heterogeneous Architectures for PGAS Programming Model

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

on Chip Architectures for Multi Core Systems

A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing

Topologies. Maurizio Palesi. Maurizio Palesi 1

Interconnection Networks

Brief Background in Fiber Optics

OFAR-CM: Efficient Dragonfly Networks with Simple Congestion Management

NoC Round Table / ESA Sep Asynchronous Three Dimensional Networks on. on Chip. Abbas Sheibanyrad

OVERVIEW: NETWORK ON CHIP 3D ARCHITECTURE

Embedded Systems: Hardware Components (part II) Todor Stefanov

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Multi-level Fault Tolerance in 2D and 3D Networks-on-Chip

Future Gigascale MCSoCs Applications: Computation & Communication Orthogonalization

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

Real-Time Dynamic Voltage Hopping on MPSoCs

Combining Arm & RISC-V in Heterogeneous Designs

Physical Design Implementation for 3D IC Methodology and Tools. Dave Noice Vassilios Gerousis

for High Performance and Low Power Consumption Koji Inoue, Shinya Hashiguchi, Shinya Ueno, Naoto Fukumoto, and Kazuaki Murakami

Ultra-Fast NoC Emulation on a Single FPGA

FastTrack: Leveraging Heterogeneous FPGA Wires to Design Low-cost High-performance Soft NoCs

Mapping and Physical Planning of Networks-on-Chip Architectures with Quality-of-Service Guarantees

Hubs, Bridges, and Switches (oh my) Hubs

Thomas Moscibroda Microsoft Research. Onur Mutlu CMU

Reduce Your System Power Consumption with Altera FPGAs Altera Corporation Public

OVERCOMING THE MEMORY WALL FINAL REPORT. By Jennifer Inouye Paul Molloy Matt Wisler

Physical Implementation of the DSPIN Network-on-Chip in the FAUST Architecture

Tightly-Coupled Multi-Layer Topologies for 3-D NoCs

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection

The Design and Implementation of a Low-Latency On-Chip Network

Routerless Networks-on-Chip

Calibrating Achievable Design GSRC Annual Review June 9, 2002

Maximizing heterogeneous system performance with ARM interconnect and CCIX

SoC Design Lecture 13: NoC (Network-on-Chip) Department of Computer Engineering Sharif University of Technology

Index 283. F Fault model, 121 FDMA. See Frequency-division multipleaccess

Special Course on Computer Architecture

Chapter 0 Introduction

Natalie Enright Jerger, Jason Anderson, University of Toronto November 5, 2010

STG-NoC: A Tool for Generating Energy Optimized Custom Built NoC Topology

Hardware Design, Synthesis, and Verification of a Multicore Communications API

Non-contact Test at Advanced Process Nodes

Meet in the Middle: Leveraging Optical Interconnection Opportunities in Chip Multi Processors

Networks. Distributed Systems. Philipp Kupferschmied. Universität Karlsruhe, System Architecture Group. May 6th, 2009

Implementing Flexible Interconnect Topologies for Machine Learning Acceleration

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP

Sort vs. Hash Join Revisited for Near-Memory Execution. Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot

Understanding the Routing Requirements for FPGA Array Computing Platform. Hayden So EE228a Project Presentation Dec 2 nd, 2003

1. IEEE and ZigBee working model.

Development of Dependable Wireless System and Device

The Tofu Interconnect D

WP-PD Wirepas Mesh Overview

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers

Deadlock: Part II. Reading Assignment. Deadlock: A Closer Look. Types of Deadlock

1. NoCs: What s the point?

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

Transcription:

Wireless NoC as Interconnection Backbone for Multicore Chips: Promises, Challenges, and Recent Developments Part IV: 3D WiNoC Architectures Hiroki Matsutani Keio University, Japan 1

Outline: 3D WiNoC Architectures So far we focused on 2D WiNoC architecture and its physical link design. This part explores 3D WiNoC architectures, especially inductive-coupling 3D option. 3D IC technologies: Wired vs. Wireless [5min] Prototype systems: Cube-0 & Cube-1 [15min] Wireless 3D NoC architectures [15min] Ring-based 3D WiNoC Irregular 3D WiNoC Experiment results and Summary [10min] 2

Design cost of LSI is increasing System-on-Chip (SoC) Required components are integrated on a single chip Different LSI must be developed for each application System-in-Package (SiP) or 3D IC Required components are stacked for each application By changing the chips in a package, we can provide a wider range of chip family with modest design cost Mar Next 24th, 2014 slides show Hiroki Matsutani, techniques "3D WiNoC Architectures", for stacking Tutorial at DATE'14 multiple chips 3

3D IC technology for going vertical Wired Wireless Two chips (face-to-face) Microbump Flexibility Capacitive coupling More than three chips Scalability Through silicon via Inductive coupling 4

Inductive coupling link for 3D ICs More than 3 chips Stacking after chip fabrication Only know-good-dies selected Bonding wires for power supply We have developed some prototype systems of wireless 3D ICs using the inductive coupling Inductor for transceiver Implemented as a square coil with metal in common CMOS Footprint of inductor Not a serious problem. Only metal layers are occupied Note: This part focuses on inter-chip wireless, not the intra-chip wireless introduced in Parts II and III. 5

Outline: 3D WiNoC Architectures So far we focused on 2D WiNoC architecture and its physical link design. This part explores 3D WiNoC architectures, especially inductive-coupling 3D option. 3D IC technologies: Wired vs. Wireless [5min] Prototype systems: Cube-0 & Cube-1 [15min] Wireless 3D NoC architectures [15min] Ring-based 3D WiNoC Irregular 3D WiNoC Experiment results and Summary [10min] 6

An example: MuCCRA-Cube (2008) 4 MuCCRA chips are stacked on a PCB board Data Memory Inductive-Coupling PE PE PE PE Down Link PE PE PE PE PE PE PE PE 5.0mm PE PE PE PE Technology: 90nm Chip thickness: 85um, Glue: 10um Inductive-Coupling Up Link 2.5mm [Saito,FPL 09] Mar 24th, 2014 7

Stacking method: Staircase stacking Inductor has TX/RX/Idle modes (mode change 1-cycle) Pillar TX Slide & stack TX Bonding wire TX TX Bonding wire TX TX Bonding wire TX Inductor (TX) Inductor (RX) TX TX 8

Stacking method: Staircase stacking Inductive-coupling link Pillar Local clock @ 4GHz Serial data TX TX System clock for NoC: 200MHz 35-bit transfer for each clock TX TxData TxClk We have fabricated some prototype multi-core TX TX systems using this wireless technology Inductor (TX) Inductor (RX) TX TxData TxClk Local clock shared by neighboring chips; No global sync. 9 TX

An example: Cube-0 (2010) Test chip for vertical communication schemes Vertical point-to-point link between adjacent chips Vertical shared bus (broadcast) Each chip has 2 cores (packet counter) 2 routers [Matsutani, NOCS 11] Inductors (P2P ring) Inductors (vertical bus) Process: Fujitsu 65nm (CS202SZ) Voltage: 1.2V System clock: 200MHz 2.1mm x 2.1mm Inductors (P2P) Core 0 & 1 Router 0 & 1 Inductors (bus) 10

An example: Cube-0 (2010) Test chip for vertical communication schemes TX Vertical point-to-point link between adjacent chips Vertical shared bus (broadcast) [Matsutani, NOCS 11] RX Stacking for Ring network 2.1mm x 2.1mm Inductors (P2P) Core 0 & 1 Slide & stack Router 0 & 1 Inductors (bus) 11

An example: Cube-0 (2010) Test chip for vertical communication schemes TX Vertical point-to-point link between adjacent chips Vertical shared bus (broadcast) [Matsutani, NOCS 11] RX Stacking for Ring network 2.1mm x 2.1mm Inductors (P2P) Core 0 & 1 TX/RX Stacking for Vertical bus Router 0 & 1 Inductors (bus) 12

An example: Cube-1 (2012) Test chips for building-block 3D systems Two chip types: Host CPU chip & Accelerator chip We can customize number & types of chips in SiP Cube-1 Host CPU chip Two 3D wireless routers MIPS-like CPU [Miura, IEEE Micro 13] Cube-1 Accelerator chip Two 3D wireless routers Processing element array MIP CPU Core 8x8 PE Array Inductor 13

An example: Cube-1 (2012) Microphotographs of test chips [Miura, IEEE Micro 13] Host CPU Chip Accelerator Chip Host CPU + 3 Accelerators 14

An example: Cube-1 (2012) Block diagram of CPU & Accelerator chips 15

An example: Cube-1 (2012) Inductive-coupling ThruChip Interface (TCI) Note: Please refer to Part III for antenna design for on-chip wireless. 16

An example: Cube-1 (2012) 17

An example: Cube-1 (2012) Cube-1 Cube-1 demo system PE array chip performs image processing CPU chip for control [Miura, HotChips 13 Demo] Motherboard 18

Outline: 3D WiNoC Architectures So far we focused on 2D WiNoC architecture and its physical link design. This part explores 3D WiNoC architectures, especially inductive-coupling 3D option. 3D IC technologies: Wired vs. Wireless [5min] Prototype systems: Cube-0 & Cube-1 [15min] Wireless 3D NoC architectures [15min] Ring-based 3D WiNoC Irregular 3D WiNoC Experiment results and Summary [10min] 19

Big picture: Wireless 3D NoC Arbitrary chips are stacked after fabrication Each chip has vertical links at pre-specified locations, but we do not know internal topology of each chip Wireless 3D NoC required to stack unknown topologies Memory chip from A GPU chip from B Note: We can add long-range links to induce small-world effects [See Part I] Required chips are stacked for each application CPU chip from C An example (4 chips) 20

Two approaches: Wireless 3D NoC arch Chips should be added, removed, swapped for each app. Ring-based approach Good Bad Bad Easy to add & remove Inefficient hop count No scalability Chip 4 Chip 3 [Matsutani, NOCS 11] Irregular approach We can use any links Irregular routing needed Plug-and-play protocol Chip 4 Chip 3 [Matsutani, ASPDAC 13] Chip 2 Chip 1 Chip 2 Chip 1 Chip 0 Chip 0 21

Ring-based 3D wireless NoC Chips are connected via unidirectional rings Pillar TX TX TX TX TX TX TX TX RX Router to horizontal NoC TX TX 22

Ring approach: Deadlock problem Ring inherently includes a cyclic dependency Pillar TX TX RX Router to horizontal NoC Buffer 23

Ring approach: Deadlock problem Ring inherently includes a cyclic dependency Pillar TX TX RX Router to horizontal NoC Buffer 24

Ring approach: Deadlock problem Ring inherently includes a cyclic dependency Pillar Cannot move Cannot move TX TX RX Router to horizontal NoC Buffer Cannot move Any packets cannot advance Deadlock avoidance is needed 25

Ring approach: Deadlock avoidance Adding extra VCs Conventional way Duplicating buffers 2 VCs for each message class Bubble flow control Buffer space of a single packet must be always reserved in each router All message classes share the same buffers [Puente,ICPP 99] Cyclic dependency is formed 26

Ring approach: Deadlock avoidance Adding extra VCs Conventional way Duplicating buffers 2 VCs for each message class Bubble flow control Buffer space of a single packet must be always reserved in each router All message classes share the same buffers Cyclic dependency is cut at the dateline VC0 Two VCs (VC0 and VC1) VC1 [Puente,ICPP 99] Dateline 2 VCs required for a message class; Multi-core uses multiple classes 27

Ring approach: Deadlock avoidance Adding extra VCs Conventional way Duplicating buffers 2 VCs for each message class Bubble flow control Buffer space of a single packet must be always reserved in each router All message classes share the same buffers [Puente,ICPP 99] Deadlock occurs, because all buffers are occupied 28

Ring approach: Deadlock avoidance Adding extra VCs Conventional way Duplicating buffers 2 VCs for each message class Bubble flow control Buffer space of a single packet must be always reserved in each router All message classes share the same buffers [Puente,ICPP 99] Deadlock does not occur unless all buffers are occupied Empty space of a packet We employ Bubble flow for CMP with multiple message classes 29

Outline: 3D WiNoC Architectures So far we focused on 2D WiNoC architecture and its physical link design. This part explores 3D WiNoC architectures, especially inductive-coupling 3D option. 3D IC technologies: Wired vs. Wireless [5min] Prototype systems: Cube-0 & Cube-1 [15min] Wireless 3D NoC architectures [15min] Ring-based 3D WiNoC Irregular 3D WiNoC Experiment results and Summary [10min] 30

Two approaches: Wireless 3D NoC arch Chips should be added, removed, swapped for each app. Ring-based approach Good Bad Bad Easy to add & remove Inefficient hop count No scalability Chip 4 Chip 3 [Matsutani, NOCS 11] Irregular approach Good We can use any links Irregular routing needed Plug-and-play protocol Chip 4 Chip 3 [Matsutani, ASPDAC 13] Chip 2 Chip 1 Chip 2 Chip 1 Chip 0 Chip 0 31

Irregular approach: Ad-hoc topology Wireless 3D CMPs Various chips are stacked, depending on the application Each chip Must have vertical links May not have horizontal links May have VCs for horizontal Chip 7 Chip 2 Ad-hoc wireless 3D NoC We cannot expect the network topology, number of VCs, and its bandwidth before stacking Chip 1 Chip 0 32

Irregular approach: Ad-hoc topology Wireless 3D CMPs Various chips are stacked, depending on the application Each chip Must have vertical links May not have horizontal links May have VCs for horizontal Chip 7 Chip 2 Ad-hoc wireless 3D NoC We cannot expect the network topology, number of VCs, and its bandwidth before stacking Chip 1 Chip 0 33

Irregular approach: Ad-hoc topology Wireless 3D CMPs Various chips are stacked, depending on the application Each chip Must have vertical links May not have horizontal links May have VCs for horizontal Chip 7 Chip 2 No horizontal link Ad-hoc wireless 3D NoC We cannot expect the network topology, number of VCs, and its bandwidth before stacking Chip 1 Chip 0 34

Irregular approach: Ad-hoc topology Wireless 3D CMPs Various chips are stacked, depending on the application Each chip Must have vertical links May not have horizontal links May have VCs for horizontal Chip 7 Extreme case: only the bottom has link Chip 2 Ad-hoc wireless 3D NoC We cannot expect the network topology, number of VCs, and its bandwidth before stacking Chip 1 Chip 0 We need a mechanism to route packets even with such cases 35

Irregular approach: Up*/down* routing Up*/down* (UD) routing Irregular network routing A root node is selected Packets go up and then go down [Schroeder, JSAC 91] Note: Please refer to Part II for routing strategy for irregular WiNoCs. Root 0 1 2 3 An example 4x4 2D mesh 4 5 6 7 A root node is selected 8 9 10 11 12 13 14 15 36

Irregular approach: Up*/down* routing Up*/down* (UD) routing Irregular network routing A root node is selected Packets go up and then go down Root [Schroeder, JSAC 91] Up direction 0 1 2 3 An example 4x4 2D mesh 4 5 6 7 Direction (up or down) is determined 8 9 10 11 12 13 14 15 37

Irregular approach: Up*/down* routing Up*/down* (UD) routing Irregular network routing A root node is selected Packets go up and then go down An example Root 4x4 2D mesh Routing path is generated Down-up turn is prohibited It generates imbalanced paths [Schroeder, JSAC 91] 0 1 2 3 OK Up direction 4 5 6 7 8 9 10 11 NG 12 13 14 15 38

Irregular approach: Up*/down* routing Up*/down* (UD) routing Irregular network routing A root node is selected Packets go up and then go down [Schroeder, JSAC 91] Chip 3 Root 0 1 Another example 3D NoC with 4 chips Chip 2 Chip 1 2 3 4 5 Chip 0 6 7 39

Irregular approach: Up*/down* routing Up*/down* (UD) routing Irregular network routing A root node is selected Packets go up and then go down [Schroeder, JSAC 91] Chip 3 Root 0 1 Another example 3D NoC with 4 chips Chip 2 Chip 1 2 3 4 5 Chip 0 6 7 40

Irregular approach: Up*/down* routing Up*/down* (UD) routing Irregular network routing A root node is selected Packets go up and then go down [Schroeder, JSAC 91] Chip 3 Root 0 1 Another example 3D NoC with 4 chips Chip 2 Chip 1 OK 2 3 4 5 NG Chip 0 The best spanning tree root is selected by exhaustive or heuristic using communication traces (9sec for 64-tile) 6 7 41

Irregular approach: UD with VCs UD routing with multiple VCs Chip 3 Each layer (VC) has its own spanning tree Packets can transit multiple layers in descent order 0 1 Root [Koibuchi,ICPP 03] [Lysne,TPDS 06] 0 1 Chip 3 Chip 2 2 3 You can use either VC0 or VC1 OK 2 3 Chip 2 Chip 1 4 5 OK 4 5 Chip 1 Chip 0 6 Root 7 6 7 Chip 0 How to recognize the topology VC1 & build VC0 multiple spanning trees? 42

Outline: 3D WiNoC Architectures So far we focused on 2D WiNoC architecture and its physical link design. This part explores 3D WiNoC architectures, especially inductive-coupling 3D option. 3D IC technologies: Wired vs. Wireless [5min] Prototype systems: Cube-0 & Cube-1 [15min] Wireless 3D NoC architectures [15min] Ring-based 3D WiNoC Irregular 3D WiNoC Experiment results and Summary [10min] 43

Full-system CMP simulations Application performance of two approaches is evaluated Ring-based approach Good Bad Bad Easy to add & remove Inefficient hop count No scalability Chip 4 Chip 3 [Matsutani, NOCS 11] Irregular approach Good We can use any links Irregular routing needed Plug-and-play protocol Chip 4 Chip 3 Chip 2 Chip 1 Chip 2 Chip 1 Chip 0 Chip 0 44

Network topology: Irregular The following iteration is performed 1,000 times Each tile has router and core (e.g., processor or caches) Each horizontal link appears with 50% We examined three cases: 16, 32, and 64 tiles 16-tile (2,2,4) 32-tile (4,2,4) 64-tile (4,4,4) 2x2 mesh * 4chips 4x2 mesh * 4chips 4x4 mesh * 4chips 45

Network topology: Irregular The following iteration is performed 1,000 times Each tile has router and core (e.g., processor or caches) Among Each 1,000 horizontal random link topologies, appears with one with 50% the most typical hop count value is selected for the full-system evaluation We examined three cases: 16, 32, and 64 tiles 16-tile (2,2,4) 32-tile (4,2,4) 64-tile (4,4,4) 2x2 mesh * 4chips 4x2 mesh * 4chips 4x4 mesh * 4chips 46

Parallel programs are running on it Ring-based approach Good Bad Bad GEMS/Simics is used for full-system simulations Easy to add & remove Inefficient hop count No scalability Chip 4 Chip 3 [Matsutani, NOCS 11] Irregular approach Good We can use any links Irregular routing needed Plug-and-play protocol Chip 4 Chip 3 [Matsutani, ASPDAC 13] Chip 2 Chip 1 Chip 2 Chip 1 Chip 0 Chip 0 47

Parallel programs are running on it GEMS/Simics is used for full-system simulations Solaris 9 is running on 8-core UltraSPARC Table 1: Topologies to be examined Routers CPUs L2$banks MCs 16-tile 16 4 32 4 32-tile 32 8 64 8 64-tile 64 8 128 16 Table 2: Simulation parameters L1$ size & latency 64K / 1cycle L2$ size & latency 256K / 6cycle Memory size & latency 4G / 160cycle Router latency [RC/VSA] [ST] [LT] Router buffer size 5-flit per VC Protocol MOESI directory Table 3: Application programs NPB (IS, DC, CG, MG, EP, LU, UA, SP, BT, FT) 48

Application exec time: 16-tile Ring-based approach (VC flow & Bubble flow controls) Irregular approach Irregular approach outperforms Ring-based one by 10.8% in 16-tile case. Mar 24th, 2014 Hiroki Matsutani, "3D WiNoC Architectures", Tutorial at DATE'14 49

Application exec time: 64-tile Ring-based approach (VC flow & Bubble flow controls) Irregular approach Irregular approach outperforms Ring-based one by 46.0% in 64-tile case. Ring has no scalability Irregular one improves significantly 51

Application exec time: 16-tile Irregular (50% of horizontal links are implemented) 3D mesh (all horizontal links are implemented) Performance of Irregular approach Irr3(min) is closed to that of 3D mesh 52

Application exec time: 64-tile Irregular (50% of horizontal links are implemented) 3D mesh (all horizontal links are implemented) Performance of Irregular approach Irr3(min) is closed to that of 3D mesh Mar Optimized 24th, 2014 Hiroki Irr3(min) Matsutani, "3D improves WiNoC Architectures", by 15.1% Tutorial at compared DATE'14 to the worst 54

Experiment results: Cube-1 (2012) Cube-1 Cube-1 demo system PE array chip performs image processing CPU chip for control [Miura, HotChips 13 Demo] Motherboard 55

Experiment results: Cube-1 (2012) 56

Experiment results: Cube-1 (2012) [Miura, IEEE Micro 13] Packet error rate: Error-free operation at nominal supply voltage Power consumption: 5.8mW per 2Gbps channel (at 0.92V) 57

Summary: 3D WiNoC Architectures Inductive-coupling 3D SiP A low cost alternative to build low-volume custom systems by stacking off-the-shelf known-good-dies No special process technology is required; inductors are implemented with metal layers Cube-1: A practical 3D WiNoC system Two types: Host CPU chip & Accelerator chips We can customize number & types of chips in SiP 58

Future plans: 3D WiNoC Architectures Cube-2 design: STM 28nm process CPU & Accelerator chips Memcached accelerator chip for smart sensors All-in-one TX/RX macro: Coil + Routers/buffers Coil uses only metal layers; silicon available Power/heat management: Dynamic on/off control of inductors Closed-loop control [Elfadel, DATE 13] (See Part I) Combine 2D & 3D WiNoC: mm-wave wireless for intra-chip (See Parts II and III) Steady Challenging 59

Future plans: 3D WiNoC Architectures Jumbo inductors: Jumbo inductor like a power ring <1mm communication Relax SiP assembly Wireless broadcast bus: Combine wireless vertical bus & P2P links Static/dynamic TDMA vs. CSMA/CD Cartridge style computer: Insert necessary chips Power/clk from cartridge Inter-chip data transfers use wireless Power from wireless?? Exploiting small-world: Add a random NoC chip to 3D WiNoC SiP to shorten path length (See Part I) Steady Challenging 60

References (1/2) Cube-0: The first real 3D WiNoC H. Matsutani, et.al., "A Vertical Bubble Flow Network using Inductive-Coupling for 3-D CMPs", NOCS 2011. Y. Take, et.al., "3D NoC with Inductive-Coupling Links for Building-Block SiPs", IEEE Trans on Computers (2014). Cube-1: The heterogeneous 3D WiNoC N. Miura, et.al., "A Scalable 3D Heterogeneous Multicore with an Inductive ThruChip Interface", IEEE Micro (2013). MuCCRA-Cube: Dynamically reconfigurable processor S. Saito, et.al., "MuCCRA-Cube: a 3D Dynamically Reconfigurable Processor with Inductive-Coupling Link", FPL 2009. 61

References (2/2) Vertical bubble flow control on Cube-0 H. Matsutani, et.al., "A Vertical Bubble Flow Network using Inductive-Coupling for 3-D CMPs", NOCS 2011. Y. Take, et.al., "3D NoC with Inductive-Coupling Links for Building-Block SiPs", IEEE Trans on Computers (2014). Spanning trees optimization for 3D WiNoCs H. Matsutani, et.al., "A Case for Wireless 3D NoCs for CMPs", ASP-DAC 2013 (Best Paper Award). 62