EECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 12: On-Chip Interconnects

Similar documents
Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems

Swizzle-Switch Networks for Many-Core Systems

EECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 14: Photonic Interconnect

Quality-of-Service for a High-Radix Switch

Phastlane: A Rapid Transit Optical Routing Network

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation

Lecture 3: Flow-Control

TDT Appendix E Interconnection Networks

Lecture 2: Topology - I

Interconnection Networks

Low-Power Interconnection Networks

Overlaid Mesh Topology Design and Deadlock Free Routing in Wireless Network-on-Chip. Danella Zhao and Ruizhe Wu Presented by Zhonghai Lu, KTH

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP

Lecture 18: Communication Models and Architectures: Interconnection Networks

A Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on

OASIS Network-on-Chip Prototyping on FPGA

4. Networks. in parallel computers. Advances in Computer Architecture

CS252 Graduate Computer Architecture Lecture 14. Multiprocessor Networks March 9 th, 2011

Chapter 9 Multiprocessors

Design and Simulation of Router Using WWF Arbiter and Crossbar

Embedded Systems: Hardware Components (part II) Todor Stefanov

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel

Pseudo-Circuit: Accelerating Communication for On-Chip Interconnection Networks

CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers

Interconnection Networks

Meet in the Middle: Leveraging Optical Interconnection Opportunities in Chip Multi Processors

Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two

Lecture 7: Flow Control - I

Snoop-Based Multiprocessor Design III: Case Studies

Lecture: Interconnection Networks

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013

Quality-of-Service for a High-Radix Switch

ECE 551 System on Chip Design

Future of Interconnect Fabric A Contrarian View. Shekhar Borkar June 13, 2010 Intel Corp. 1

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics

Lecture 3: Topology - II

SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES. Natalie Enright Jerger University of Toronto

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

Networks for Multi-core Chips A A Contrarian View. Shekhar Borkar Aug 27, 2007 Intel Corp.

Design and Implementation of Multistage Interconnection Networks for SoC Networks

Network on Chip Architecture: An Overview

Design of Adaptive Communication Channel Buffers for Low-Power Area- Efficient Network-on. on-chip Architecture

Challenges for Future Interconnection Networks Hot Interconnects Panel August 24, Dennis Abts Sr. Principal Engineer

Interconnection Networks

OASIS NoC Architecture Design in Verilog HDL Technical Report: TR OASIS

Lect. 6: Directory Coherence Protocol

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

Brief Background in Fiber Optics

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

Lecture 12: SMART NoC

A Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013

Lecture 11: Large Cache Design

Thomas Moscibroda Microsoft Research. Onur Mutlu CMU

ECE 571 Advanced Microprocessor-Based Design Lecture 10

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

Power and Performance Efficient Partial Circuits in Packet-Switched Networks-on-Chip

SCORPIO: 36-Core Shared Memory Processor

15-740/ Computer Architecture Lecture 19: Main Memory. Prof. Onur Mutlu Carnegie Mellon University

On-chip Monitoring Infrastructures and Strategies for Many-core Systems

Performance of coherence protocols

ECE/CS 757: Advanced Computer Architecture II Interconnects

Scalable Cache Coherence

STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

FUTURE high-performance computers (HPCs) and data. Runtime Management of Laser Power in Silicon-Photonic Multibus NoC Architecture

Module 17: "Interconnection Networks" Lecture 37: "Introduction to Routers" Interconnection Networks. Fundamentals. Latency and bandwidth

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

SEMICON Solutions. Bus Structure. Created by: Duong Dang Date: 20 th Oct,2010

A VERIOG-HDL IMPLEMENTATION OF VIRTUAL CHANNELS IN A NETWORK-ON-CHIP ROUTER. A Thesis SUNGHO PARK

EE/CSCI 451: Parallel and Distributed Computation

Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip

International Journal of Research and Innovation in Applied Science (IJRIAS) Volume I, Issue IX, December 2016 ISSN

EE382 Processor Design. Illinois

Lecture 22: Router Design

High Performance and Low Power On-Die Interconnect Fabrics

Buses. Maurizio Palesi. Maurizio Palesi 1

EECS 570 Final Exam - SOLUTIONS Winter 2015

ECE 669 Parallel Computer Architecture

1. NoCs: What s the point?

Packet Switch Architecture

Packet Switch Architecture

Network-on-Chip Architecture

A More Sophisticated Snooping-Based Multi-Processor

Implementing Flexible Interconnect Topologies for Machine Learning Acceleration

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks

Design of a high-throughput distributed shared-buffer NoC router

CMPE 150/L : Introduction to Computer Networks. Chen Qian Computer Engineering UCSC Baskin Engineering Lecture 11

Recall: The Routing problem: Local decisions. Recall: Multidimensional Meshes and Tori. Properties of Routing Algorithms

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control

Buses. Disks PCI RDRAM RDRAM LAN. Some slides adapted from lecture by David Culler. Pentium 4 Processor. Memory Controller Hub.

EE108B Lecture 17 I/O Buses and Interfacing to CPU. Christos Kozyrakis Stanford University

Lecture 15: NoC Innovations. Today: power and performance innovations for NoCs

Hardware Design, Synthesis, and Verification of a Multicore Communications API

IV. PACKET SWITCH ARCHITECTURES

Transcription:

1 EECS 598: Integrating Emerging Technologies with Computer Architecture Lecture 12: On-Chip Interconnects Instructor: Ron Dreslinski Winter 216 1 1

Announcements Upcoming lecture schedule Today: On-chip Interconnects (Circuits and Photonics) 2/25: Off-chip Interconnects 3/1 & 3/3: Spring Break 3/8: Midterm Review, Structure for rest of lectures, Project Specification Midterm Exam: In-Class March 1 th, closed notes 2 Reading Assignments: For this week: "Swizzle-Switch Networks for Many-Core Systems," Korey Sewell et al. JETCAS IEEE Journal on Emerging and Selected Topics in Circuits and Systems, June 212 "Corona: System implications of emerging nanophotonic technology." Dana Vantrease et al. ISCA 28. 2 2

Outline Swizzle Switch Circuit & Microarchitecture Overview Arbitration Prototype Swizzle Switch Cache Coherent Manycore Interconnect Motivation & Existing Interconnects Swizzle Switch Interconnect Evaluation 3 3 3

Swizzle Switch 4 Conventional Matrix Crossbar Swizzle Switch Embeds arbitration within crossbar single cycle arbitration Re-use input/output data buses for arbitration SRAM-like layout with priority bits at cross-points Low-power optimizations Excellent scalability 4 4

Data Routing Source IP Source IP 128 1 Pre-charge 1 Multicast & Broadcast Bitlines discharged if Data = 1 Crosspoint = 1 5 Source IP 1 Read Buffers 128 Dest. IP Dest. IP Dest. IP 5 5

Swizzle Switch Architecture 6 Output bus used as prioritybus Source IP Source IP 128 1 Pre-charge 1 Req k Rel k Req l Rel l vector 1 1 1 1 Source IP vectors 1 Req m Rel m 1 1 1 1 1 128 Dest. IP Read Buffers Dest. IP Dest. IP Sense Amp + Latch Req n Rel n 1 1 Data routing, arbitration, And priority update control embedded within crosspoints 6 6

Outline Swizzle Switch Circuit & Microarchitecture Overview Arbitration Prototype Swizzle Switch Cache Coherent Manycore Interconnect Motivation & Existing Interconnects Swizzle Switch Interconnect Evaluation 7 7 7

Inhibit Based Arbitration Req l Req m Sense Amp + Latch Req n Output bus used as priority bus x y z line l x < y line m 1 z > y 1 1 line n This diagram is a single column in the Swizzle-Switch (output), each output arbitrates/transfers data independently Each Crosspoint has a sense amp/ latch to indicate connectivity. Each input samples a unique bit of the output bus to determine if it has been granted the channel vectors are stored and when a request is issued they discharge bits along the output columns to INHIBIT lower priority requests Finally, the priority vectors are updated when the data transfer completes. 8 8 8

Least Recently Granted(LRG) Output bus used as priority bus 9 Req l x line l x < y line n LOWEST Discharges NO Lines Req m Sense Amp + Latch Req n y z line m 1 z > y 1 1 INTERMEDIATE Discharges SOME Lines HIGHEST Discharges ALL Lines 9 9

Least Recently Granted(LRG) Output bus used as priority bus 1 Req l Req m Sense Amp + Latch Req n x 1 y z line l x < y line m 1 z > y 1 1 line n Example Arbitration: (1) Req l and Req m Request the bus (red lines) (2) Req m discharges line l, priority lines m and n remain charged (green lines) (3) Req l senses line l and is inhibited (not granted), Req m senses line m and is not inhibited (4) The crosspoint records the connectivity at input m 1 1

Least Recently Granted(LRG) Output bus used as priority bus 11 Example Update: Req l Rel l x SET Input m signals it is done with data transfer by asserting Rel m Req m Rel m 1 y 1 RESET Req n Rel n z 1 1 11 11

Least Recently Granted(LRG) Output bus used as priority bus 12 Req l Rel l x + 1 1 INTERMEDIATE Discharges SOME Lines Req m Rel m LOWEST Discharges NO Lines Req n Rel n z 1 1 HIGHEST Discharges ALL Lines 12 12

Least Recently Granted(LRG) Output bus used as priority bus 13 Req l Rel l Req m Rel m x + 1 1 Increasing x y 62 z LRG Algorithm l m 62 n m m releases channel l m-1 62 n Least priority upgraded unchanged Req n Rel n z 1 1 13 13

Outline Swizzle Switch Circuit & Microarchitecture Overview Arbitration Prototype Swizzle Switch Cache Coherent Manycore Interconnect Motivation & Existing Interconnects Swizzle Switch Interconnect Evaluation 14 14 14

64x64 Prototype 15 15 15

Measurement Results 16 16 16

Measurement Results 17 17 17

Outline Swizzle Switch Circuit & Microarchitecture Overview Arbitration Prototype Swizzle Switch Cache Coherent Manycore Interconnect Motivation & Existing Interconnects Swizzle Switch Interconnect Evaluation 18 18 18

Scaling Interconnect for Many-Cores 19 Existing interconnects Buses, Crossbars, Rings Limited to ~16 cores Other s Interconnect proposals for Many-Cores Packet-switched, multi-hop, network-on-chip (NoC) Grid of routers meshes, tori and flattened butterfly Our Proposal Swizzle Switch Networks Flat single-stage, one-hop, crossbar++ interconnect 19 19

Mesh Network-on-Chip 2 Memory Controller Memory Controller Memory Controller Memory Controller Memory Controller Memory Controller 13.8 mm 32 kb ICache 1.73 mm ARM Cortex A5 256 kb L2 Cache Bank 32 kb DCache R 1.73 mm Memory Controller Memory Controller Area = 3. mm 2 13.8 mm Area = 19 mm 2 (a) Router R 2 2

Flattened Butterfly Network-on-Chip 21 Memory Controller Memory Controller Memory Controller Memory Controller Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Memory Controller Memory Controller 13.8 mm Memory Controller Memory Controller 13.8 mm Area = 19 mm 2 21 21

Motivating Swizzle Switch Networks 22 Uniform access latency Ease of programming, data placement, thread placement,... Low Power Simplicity Packet-switched NoCs need routing, congestion management, flow control, wormhole switching,... 22 22

Motivating Swizzle Switch Networks 23 Mesh SSN Average'='.17'flits/node/sec' Unfairness'='42.2''' Average'='.13'flits/node/sec' Unfairness'='1.8''' accepted throughput [flits/node/cycle].6.5.4.3.2.1. 1 2 3 4 5 6 7 8 node index (X) 1 3 5 7 node index (Y).6.5.4.3.2.1. 1 2 3 4 5 6 7 8 node index (X) 1 3 5 7 node index (Y) Unfairness = Node highest_throughput / Node lowest_throughput Hotspot Traffic = All nodes sending data to node 8,8 Under Hotspot traffic, the Crossbar has a slightly less throughput than the Mesh but is 4x more fair. 23 23

Motivating Swizzle Switch Networks 24 Mesh Average'='.36'flits/node/sec' Unfairness'='1.89''' SSN Average'='.47'flits/node/sec' Unfairness'='1.2'''.6.6.5.5.4.4.3.2.1. 1 2 3 4 5 6 7 8 node index (X) 1 3 5 7 node index (Y).3.2.1. 1 2 3 4 5 6 7 8 node index (X) 1 3 5 7 node index (Y) In the Mesh, nodes closest to the center receive the highest throughput Under Uniform Random traffic, the Crossbar has more throughput than the Mesh and is 87% more fair. 24 24

Motivating Swizzle Switch Networks 25 25 25

Outline Swizzle Switch Circuit & Microarchitecture Overview Arbitration Prototype Swizzle Switch Cache Coherent Manycore Interconnect Motivation & Existing Interconnects Swizzle Switch Interconnect Evaluation 26 26 26

Top-Level Floorplan 27 Memory Controller Memory Controller Memory Controller Memory Controller NW WN WS SW Swizzle Switch x 3 NE EN ES SE Memory Controller Memory Controller 14.3 mm 3.27 mm 512 kb L2 cache L2 Area = 4.5 mm 2 32 kb Icache.86 mm Arm Cortex A5 32 kb Dcache.86 mm 1.38 mm Memory Controller 14.3 mm Total Area = 24 mm 2 Memory Controller Core + L1 Area =.74 mm 2 27 27

Outline Swizzle Switch Circuit & Microarchitecture Overview Arbitration Prototype Swizzle Switch Cache Coherent Manycore Interconnect Motivation & Existing Interconnects Swizzle Switch Interconnect Evaluation 28 28 28

Evaluation 29 Simulation Parameters Feature NoC (Mesh/FBFly) SSN Processors L1 Cache 64 in-order cores, 1 IPC, 1.5 GHz 32kB I/D Caches, 4-way associative, 64-byte line size, 1 cycle latency L2 Cache Shared L2, 16 MB, 64-way banked, 8- way associative, 64-byte line size, 1 cycle latency Interconnect Main Memory 3. GHz, 128-bit, 4-stage Routers, 3 virt. networks w/ 3 virt. channels 496MB, 5 cycle latency Shared L2, 16MB, 32-way banked, 16-way associative, 64-byte line size, 11 cycle latency 1.5 GHz, 64x32x128bit Swizzle Switch Network Benchmarks SPLASH 2 : Scientific parallel application suite 29 29

Results Performance & QoS 3 Normalized Execution Time 1% 9% 8% 7% 6% 5% 4% 3% 2% 1% % Mesh FBFly SSN Mesh FBFly SSN Mesh FBFly SSN Mesh FBFly SSN Synchronization Stall Core Active Memory Stall Mesh FBFly SSN Mesh FBFly SSN Mesh FBFly SSN Mesh FBFly SSN Mesh FBFly SSN Mesh FBFly SSN Mesh FBFly SSN Mesh FBFly SSN Barnes. Cholesky. FFT. FMM. Lu Contig. Lu NonContig. Ocean Contig. Ocean NonContig. Radix. Raytrace. Water NSquared. Water Spatial Overall Performance Quality-of-Service 3 3

Results Power 31 Interconnect Power (mw)! 7,! 6,! 5,! 4,! 3,! 2,! 1,!! Wire Dynamic! SSN/Router Dynamic! SSN/Router Leakage! Clock! Buffer! Barnes!.! Cholesky!.! FFT!.! FMM!.! Lu Contig!.! Lu NonContig!.! Ocean Contig!.! Ocean NonContig!.! Radix!.! Raytrace!.! Water NSquared!.! Water Spatial!.! Weighted Average! On average the SSN uses 28% less power in the interconnect compared to a flattened butterfly Total Benchmark Energy! (pj/inst)! 9! 8! 7! 6! 5! 4! 3! 2! 1!! Interconnect! L2 Dynamic! L2 Leakage! Core/L1 Dynamic! Core/L1 Leakage! Barnes!.! Cholesky!.! FFT!.! FMM!.! LuContig!.! Lu NonContig!.! Ocean Contig!.! Ocean NonContig!.! Radix!.! Raytrace!.! Water NSquared! Which results in an average reduction in total system energy to complete the task of 11%.! Water Spatial!.! Average! 31 31

Summary Swizzle Switch Prototype (45nm) 64x64 Crossbar with 128-bit busses Embedded LRG priority arbitration Achieved 4.4 Tbps @ ~6MHz consuming only 1.3W of power 32 Swizzle Switch Network Evaluation Improved performance by 21% Reduced power by 28% Reduced latency variability by 3x 32 32

33 Additional Detailed Slides 33 33

Arbitration Mechanism (Matrix View) 34 Pre-charge 128 Source IP Req Rel Source IP Req Rel Inhibits (X) Source IP Req Rel 128 Dest. IP Read Buffers Dest. IP Dest. IP Requests (R) X X 1 X 2 X 3 X 4 R X 1 1 R 1 X R 2 1 1 X 2 R 3 1 1 1 X 1 4 R 4 1 1 1 X 3 34 34

Least Recently Granted (LRG) 35 S et Reset X X 1 X 2 X 3 X 4 In X 1 1 In 1 X In 2 1 1 X 2 In 3 1 1 1 X 1 4 In 4 1 1 1 X 3 X X 1 X 2 X 3 X 4 In X 1 1 2 In 1 X 1 1 In 2 1 1 X 1 3 In 3 1 1 1 X 1 4 In 4 X 35 35

Round Robin Arbitration Output bus used as priority bus 36 Req l Rel l x SET Req m Rel m y 1 Req n Rel n z 1 1 RESET 36 36

Round Robin Arbitration Output bus used as priority bus 37 Req l Rel l Req m Rel m x + 1 y + 1 1 1 1 Increasing Round Robin Algorithm x y 62 z l m 62 n m l m-1 62 n Least priority upgraded Req n Rel n 37 37

QoS Arbitration 38 Increasing QoS Arbitration 3 3 3 2 QoS values LRG Arbitration 3 3 3 Highest QoS Winner 1 38 38

Timing Diagram 39 39 39

Crosspoint Circuit 4 wl out<> out<64> clock_b Crosspoint(x,y) Rel_b in<> in<64> wl -line bit clock Sense_ amp out<y> Reset_b Rel in<y> out<y> (priority_line) Rel Req wl Ack in<x+64> clock_b in<x> used to index Req/Rel out<y> sampled for Arbitration in<x+64> used for sending ack out<y> discharged for LRG update in<63> Rel_b wl out<63> out<127> in<127> 4 4

Regenerative Bit-line Repeater 41 16 16 Macro clock_b clock_b Bit-line Repeaters Bitline_R SSN Bitline_L clock Regeneration and Decoupling improves speed Bitline_L Bitline_R 41 41

Simulated bit-line delay improvement 42 Technology : 45nm Supply : 1.1V Temperature : 25 C 42 42

SSN Scaling: Simulation Technology : 45nm Supply : 1.1V Temperature : 25 C 43 wo Repeater 256 bit Bus with Repeater 128 bit Bus with Repeater Regenerative repeaters improve SSN scalability 43 43

Swizzle Switch Network-on-Chip 44 (1%)*%3, (1%)!%-./ ('%)!%-./ ('%)*%+, (1%)!%+, (1%)*%-./ ('%)*%-./ ('%)!%+, Memory Controller Memory Controller $ # 4 $ # $ $ 4 Memory Controller Memory Controller NW WN WS SW Swizzle Switch x 3 Memory Controller L1 14.3 mm NE EN ES SE Memory Controller Destination L2 Memory Controller Memory Controller 14.3 mm 1(%)*%-./ 1(%)!%-./ 1(%)*%+, 1%)*%+, 1%)!%-./ 1(%)!%+, 1%)!%+, 1%)*%-./ $ 4 $ $ # 4 # $ $!"#$%&$!'# ()*+'*"', -./12$ -./345 67+88 9 '%)!%-./ 1%)!%-./ 1%)*%3, 4 # $ 4!"#$%&$!"- ()*()*"', -./12 -./345 "6"88 9!"$:;*!'#$%&$!"# +'*()*"', -./12$-./345 67+88 9 $ $ # '%)!%+, $ # $ # 4 $ $ 4 '(%)*%+,% '(%)!%-./% '(%)*%-./!"#$%&& '(%)!%+, '%)!%+, '%)*%+, '%)*%-./ '%)!%-./ Source L1 L2 Shared Data Data Forwarding Responses Invalidations Requests Writebacks '%)*%-./ 1%)*%-./ 1%2!%+, '%)*%+,!"#$%&& (b) 44 44

Results 64-core with A9 O3 cores 45 45 45