Meet in the Middle: Leveraging Optical Interconnection Opportunities in Chip Multi Processors

Similar documents
Hybrid On-chip Data Networks. Gilbert Hendry Keren Bergman. Lightwave Research Lab. Columbia University

IITD OPTICAL STACK : LAYERED ARCHITECTURE FOR PHOTONIC INTERCONNECTS

Brief Background in Fiber Optics

Network-on-Chip Architecture

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

EDA for ONoCs: Achievements, Challenges, and Opportunities. Ulf Schlichtmann Dresden, March 23, 2018

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation

Emerging Platforms, Emerging Technologies, and the Need for Crosscutting Tools Luca Carloni

Computer Systems Research in the Post-Dennard Scaling Era. Emilio G. Cota Candidacy Exam April 30, 2013

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)

Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems

Phastlane: A Rapid Transit Optical Routing Network

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Optimizing Replication, Communication, and Capacity Allocation in CMPs

Multiprocessors & Thread Level Parallelism

EECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 12: On-Chip Interconnects

826 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 33, NO. 6, JUNE 2014

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Many-Core Computing Era and New Challenges. Nikos Hardavellas, EECS

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Lecture 11: Large Cache Design

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip

CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers

Computer Systems Architecture

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:

3D WiNoC Architectures

A Composite and Scalable Cache Coherence Protocol for Large Scale CMPs

EECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 14: Photonic Interconnect

2 TEST: A Tracer for Extracting Speculative Threads

Low-Power Reconfigurable Network Architecture for On-Chip Photonic Interconnects

Computer Systems Architecture

OPAL: A Multi-Layer Hybrid Photonic NoC for 3D ICs

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1)

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Why we should care about parallel programming in securing Cyber-physical systems

Power dissipation! The VLSI Interconnect Challenge. Interconnect is the crux of the problem. Interconnect is the crux of the problem.

Module 9: "Introduction to Shared Memory Multiprocessors" Lecture 16: "Multiprocessor Organizations and Cache Coherence" Shared Memory Multiprocessors

The Reuse Cache Downsizing the Shared Last-Level Cache! Jorge Albericio 1, Pablo Ibáñez 2, Víctor Viñals 2, and José M. Llabería 3!!!

FUTURE high-performance computers (HPCs) and data. Runtime Management of Laser Power in Silicon-Photonic Multibus NoC Architecture

Do we need a crystal ball for task migration?

Lecture: Memory, Multiprocessors. Topics: wrap-up of memory systems, intro to multiprocessors and multi-threaded programming models

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Silicon Photonics PDK Development

Lecture 2: Topology - I

Low-Power Interconnection Networks

Token Coherence. Milo M. K. Martin Dissertation Defense

Future of Interconnect Fabric A Contrarian View. Shekhar Borkar June 13, 2010 Intel Corp. 1

Chapter 5. Multiprocessors and Thread-Level Parallelism

Initial Results on the Performance Implications of Thread Migration on a Chip Multi-Core

Interconnection Networks

COSC 6385 Computer Architecture - Multi Processor Systems

ECE/CS 757: Advanced Computer Architecture II Interconnects

ECE 669 Parallel Computer Architecture

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

The Impact of Optics on HPC System Interconnects

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle

Pseudo-Circuit: Accelerating Communication for On-Chip Interconnection Networks

Dynamic Flow Regulation for IP Integration on Network-on-Chip

GLocks: Efficient Support for Highly- Contended Locks in Many-Core CMPs

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

WALL: A Writeback-Aware LLC Management for PCM-based Main Memory Systems

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS

Memory Hierarchy. Slides contents from:

Computer Architecture Lecture 24: Memory Scheduling

Photonics & 3D, Convergence Towards a New Market Segment Eric Mounier Thibault Buisson IRT Nanoelec, Grenoble, 21 mars 2016

NoC Simulation in Heterogeneous Architectures for PGAS Programming Model

An Energy-Efficient Asymmetric Multi-Processor for HPC Virtualization

Shared Symmetric Memory Systems

TDT Appendix E Interconnection Networks

Networks for Multi-core Chips A A Contrarian View. Shekhar Borkar Aug 27, 2007 Intel Corp.

Outline. Distributed Shared Memory. Shared Memory. ECE574 Cluster Computing. Dichotomy of Parallel Computing Platforms (Continued)

Multi-Processor / Parallel Processing

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

6 February Parallel Computing: A View From Berkeley. E. M. Hielscher. Introduction. Applications and Dwarfs. Hardware. Programming Models

Hierarchical Cluster based NoC design using Wireless Interconnects for Coherence Support

Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh

Atomic Coherence: Leveraging Nanophotonics to Build Race-Free Cache Coherence Protocols. Dana Vantrease, Mikko Lipasti, Nathan Binkert

Staged Memory Scheduling

Lecture 2: Memory Systems

ECE 551 System on Chip Design

Lecture 7: Implementing Cache Coherence. Topics: implementation details

Index 283. F Fault model, 121 FDMA. See Frequency-division multipleaccess

MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS

Network on Chip Architecture: An Overview

HANDSHAKE AND CIRCULATION FLOW CONTROL IN NANOPHOTONIC INTERCONNECTS

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

Graphene-enabled hybrid architectures for multiprocessors: bridging nanophotonics and nanoscale wireless communication

Special Course on Computer Architecture

Overcoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics

Exploiting Dark Silicon in Server Design. Nikos Hardavellas Northwestern University, EECS

170 Index. Delta networks, DENS methodology

Functional Requirements for Grid Oriented Optical Networks

Transcription:

Meet in the Middle: Leveraging Optical Interconnection Opportunities in Chip Multi Processors Sandro Bartolini* Department of Information Engineering, University of Siena, Italy bartolini@dii.unisi.it OPTICS Workshop, Grenoble, 13/3/2015 Acknowledgement: Paolo Grani, Davide Bertozzi, Luca Ramini,

Outline Introduction Ring based optical interconnection tradeoffs Fast path setup for switched optical networks Software restructuring for matching ONoC features Conclusions 2

Introduction and motivation processors (1) Nowadays processors are parallel Biggest reason was the emerging of wire delay issues i.e. on chip latency Pentium 4 (1) CoreDuo (2) i7 980X (6) i7 5960X / AMD FX8370 (8) Beyond about 10 tiled design Tilera Tile64 (64) Intel Polaris (80) 3

Introduction and motivation processors (2) when processor A wants to talk to processor B Shared memory model is here to stay some more At least within core clusters Ease of programmability Scalable directory based coherence Numerous message exchanged for each load/store E.g. 80 % control (8 byte) and only 20% data (64 byte) read miss MOESI write miss 4

Introduction and motivation coherence traffic 16 cores 64 cores Total and control (Ctrl) or data (Data) only stats for ideal network Low average load, bursty message latency is critical 5

Introduction and motivation Integrated photonics (1) Optical communication technology can now be integrated in CMOS process Pros: Fast propagation (16 ps/mm) 10x less latency compared to electronics (e.g. 22nm [1]) E.g. Couple of cycles @3 GHz to cross a 2cm chip corner to corner Compatible with CMOS fabrication High bandwidth: 10 40 GHz frequency and WDM (order of 1 Tbps) End to end: energy consumption almost insensitive to distance Cons: End to end no store and forward Throw away a lot of knowhow Active components (lasers, photodetectors) have some integration problems Can have or induce significant static power consumption [1] Sacham et al. Photonic Networks on Chip for Future Generations of Chip Multiprocessors, IEEE Trans. Computers, 2008 6

Introduction and motivation Integrated photonics (2) Integrated photonics is still in its infancy in serving computer systems local requirements Traffic close to cores is very different from the aggregated traffic at wider scale (e.g. blade/rack/datacenter) Layered design is not yet consolidated enough Application, architecture, network level requirements and choices are not orthogonal to optical design choices Like: topology, access schemes, resource provisioning, DWDM, technological choices Interactions can induce very different performance/consumption in the optical network Need for an integrated multi layer approach Effective designs Exploration and consolidation of best practices 7

Outline Introduction Ring based optical interconnection tradeoffs Fast path setup for switched optical networks Software restructuring for matching ONoC features Conclusions 8

Ring based ONoC tradeoffs Hybrid Electronic Optical on chip networks or all optical Optical network based on ring logical topology Simplicity of ring topology can be a good reference design point We analyze the relationships between optical resource provisioning (waveguides) and core number, versus: Traffic quote offloaded to the optical network (One ring) Only read requests, invalidations and invalidation acks (Multiple rings) All traffic MWSR, MWMR access schemes Considering performance and power metrics [Grani, Bartolini, A Simple On chip Optical Interconnection for Improving Performance of Coherency Traffic in CMPs, DSD, 2012] [Grani, Bartolini, Design Options for Optical Ring Interconnect in Future Client Devices, ACM JETC, 2014] 9

Simulation Environment and Methodology Gem5 simulator in Full System mode (Linux 2.6.27 booted in gem5), 8/16/32/64 core running Parsec 2.1 multi threaded benchmark suite Multi threaded applications We forced core affinity on the application execution to avoid non determinism due to OS scheduling Cores L1 caches L2 cache Directory ENoC Optical Ring Main memory 8/16/32 cores (64 bit), 4 GHz 16 kb (I) + 16 kb (D), 2-way, 1 cycle hit time 16 MB, 8-way, shared and distributed 8/16/32 banks, 3/12 cycles tag/tag+data MOESI protocol, 8/16/32 slices, 3 cycles 2D-Mesh/Torus, 4 GHz, 4/5 cycles/hop, 32 nm, 1 V, 64/128 bit/flit 3D-stacked, 1-9 parallel waveguides, 30 mm length, 8/16/32 I/O ports, 10 GHz, 64/70 (16 and 32) wavelengths, 460 ps full round 4 GB, 300 cycles Architectural parameters

Results: multiple rings and all traffic No electrical NoC : searching for good design points From low bandwidth up to 8/9 64 wavelength rings Bandwidth per endpoint is too low (1/2/4 bits) and each message transmission suffers avalanche conflicts and long serialization Best tradeoffs: 8 core MWMR, 1 ring MWSR, 2 ring 16 core MWMR 1 or 2 ring MWSR 4 ring 8-core 16-core Saturation Waste of optical resources [Grani, Bartolini, Design Options for Optical Ring Interconnect in Future Client Devices, ACM JETC, 2014]

Results: energy Sensitivity to the number of photonics rings for MWMR and MWSR Ring number increase Overall NoC optical power increase, MRR increase, IL increase (crossing, splitting, ), increased laser power But some topology assumptions can make the difference MWMR / MWSR Extremely high static power for MWMR with multiple rings. 8-core 16-core Static consumption of the MWSR ring increases for 8 ring also due to token waveguide Overall MWSR is the best choice for all traffic support [Grani, Bartolini, Design Options for Optical Ring Interconnect in Future Client Devices, ACM JETC, 2014]

Outline Introduction Ring based optical interconnection tradeoffs Fast path setup for switched optical networks Software restructuring for matching ONoC features Conclusions 13

Fast path setup for switched optical networks Switched optical networks can provide Potentially higher scalability than ring (crossbar) based approaches Require broadband optical switches But suffer from sequential path setup time High overhead for high endpoint number and for small message sizes small can mean less than 1000s bytes! Coherence traffic is out of game Serial path setup [1] Sacham et al. Photonic Networks on Chip for Future Generations of Chip Multiprocessors, IEEE Trans. Computers, 2008 14

Fast path setup for switched optical networks We propose a centralized arbiter that can simultaneously configure the required optical switches through a wavelength routed optical ring Network from cores to arbiter is optical ring based (wavelength routed) Decoupling topologies of path setup network and data network 15

Results single arbiter Setup Serial PS [cycles] 8 core AVG 51.26 25.94 16 core AVG 70.37 27.45 32 core AVG 156.18 65.60 Simultaneous PS [cycles] Path setup latency: dramatic average reduction even if conceptually the arbiter serializes pathsetup requests Arbiter works well 8 and 16 core setups For 32 core case arbiter induce 25% slowdown due to path setup serialization In all cases arbiter performs much better than the serial path setup Serial PS prevents cache coherent traffic to work Arbiter support this traffic 8 core 16 core 32 core

Scalable fast setup: multiple arbiter PM2 Multi-Arbiter 4-Clusters 16 core 32 core 64 core One arbiter per cluster, optical ring between clusters Independent path setups within clusters Coordinated inter cluster path setup 2 level MOESI to increase in cluster paths 64 core (PM2, PM3, PM4) 17

Outline Introduction Ring based optical interconnection tradeoffs Fast path setup for switched optical networks Software restructuring for matching ONoC features Conclusions 18

Software restructuring for matching ONoC features On chip optical networks can potentially improve wire delay issues in tiled CMPs Distant cores (i.e. many hops away) can be actually reached in a few cycles Access time to distributed cache resources is more uniform Not Uniform Cache Access (NUCA) architectures (e.g. in tiled CMPs) Software restructuring techniques for locality typically aim at putting data close to usage (cores) as to reduce access time in NUCA caches Delicate balance between ideal access barycenter from source cores and conflicts (misses) penalties in zones that need to be congested for low NoC access overhead 19

Software restructuring for matching ONoC features (2) With ONoCs on average we can afford to use far more cache Spread data much more in the chip to reduce conflicts (misses) as distance overhead is almost constant Hyp: using our arbitrated optical switched network Not exactly straightforward though: Actually with switched networks, conflicts arise not only for data placement. but for message paths and, specifically, sub paths! Need to Spread data to gain from reduced conflict misses But (!) not too far from barycenter otherwise average path length increases and path setup conflict probability increases (big overhead!) Each VM page is positioned in the access barycenter first Then it is tried around a radius of H hops, looking for the minimum of a cost access function Considering: misses, path length and conflicts, cache, memory and ONoC access times [ Grani, Bartolini, Frediani, Ramini, Bertozzi, "Integrated Cross Layer Solutions for Enabling Silicon Photonics into Future Chip Multiprocessors", IEEE IMS3TW, 2014] 20

Results: Software restructuring for switched ONoC 16 core, PM2 Software restructuring allows gaining 19% more speedup over the standard arbiter solution (7% over the Baseline) Electronic baseline cannot benefit from this software restructuring due to heavy Non uniform access time to tiles

Outline Introduction Ring based optical interconnection tradeoffs Fast path setup for switched optical networks Software restructuring for matching ONoC features Conclusions 22

Conclusions Integrated optics is a breakthrough technology that brings a number of positive facets Improvements in devices will make it even better perspective Its technological discontinuity needs to be integrated with patience and with a thick vertical design approach in modern processor and computer design To master and exploit the various two way inter related effects that are nowadays present between layers significant risk of sub optimal designs if not even worse than some well tuned electronic solutions For reaching effective designs Opportunities and constraints of different layers must meet in the middle 23

Meet in the Middle: Leveraging Optical Interconnection Opportunities in Chip Multi Processors Thanks for your attention! Q & A