TRIPS: A Distributed Explicit Data Graph Execution (EDGE) Microprocessor

Similar documents
Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture

TRIPS: Extending the Range of Programmable Processors

Creating a Scalable Microprocessor:

Copyright. Ramadass Nagarajan

Architecture. Karthikeyan Sankaralingam Ramadass Nagarajan Haiming Liu Changkyu Kim Jaehyuk Huh Doug Burger Stephen W. Keckler Charles R.

An Evaluation of the TRIPS Computer System (Extended Technical Report) 1

CS 426 Parallel Computing. Parallel Computing Platforms

Copyright. Haiming Liu

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University


Dynamic Vectorization in the E2 Dynamic Multicore Architecture

The Processor: Instruction-Level Parallelism

Multi-core Architectures. Dr. Yingwu Zhu

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

COMPUTER ORGANIZATION AND DESI

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP

Multi-core Architectures. Dr. Yingwu Zhu

Processor (IV) - advanced ILP. Hwansoo Han

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks

Outline Marquette University

Reference. T1 Architecture. T1 ( Niagara ) Case Study of a Multi-core, Multithreaded

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

KiloCore: A 32 nm 1000-Processor Array

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Getting to Work with OpenPiton. Princeton University. OpenPit

The Alpha and Microprocessors:

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor

Handout 2 ILP: Part B

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Simultaneous Multithreading on Pentium 4

A Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013

A Characterization of High Performance DSP Kernels on the TRIPS Architecture

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

RAMP-White / FAST-MP

Advanced Instruction-Level Parallelism

Chapter 4 The Processor (Part 4)

Itanium 2 Processor Microarchitecture Overview

Multiple Instruction Issue. Superscalars

Getting to Work with OpenPiton

Instruction set architectures have long lifetimes

Vector IRAM: A Microprocessor Architecture for Media Processing

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Intel Enterprise Processors Technology

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Multithreaded Processors. Department of Electrical Engineering Stanford University

Microarchitecture Overview. Performance

Computer Architecture. What is it?

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Lec 25: Parallel Processors. Announcements

CS 152 Computer Architecture and Engineering

Hybrid Operand Communication for Dataflow Processors

Microarchitecture Overview. Performance

History. PowerPC based micro-architectures. PowerPC ISA. Introduction

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS

Portland State University ECE 588/688. Cray-1 and Cray T3E

Design and Implementation of the TRIPS Primary Memory System

UMBC. Rubini and Corbet, Linux Device Drivers, 2nd Edition, O Reilly. Systems Design and Programming

The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

ReFLEX: Block Atomic Execution on Conventional ISA Cores

Introducing the Cray XMT. Petr Konecny May 4 th 2007

MIT Laboratory for Computer Science

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Copyright 2012, Elsevier Inc. All rights reserved.

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture

Parallel Computing: Parallel Architectures Jin, Hai

An Accurate Architectural Simulator for ARM1136

DDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Jim Keller. Digital Equipment Corp. Hudson MA

Department of Computer Science, Institute for System Architecture, Operating Systems Group. Real-Time Systems '08 / '09. Hardware.

A Streaming Multi-Threaded Model

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Intel released new technology call P6P

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Modern Processor Architectures. L25: Modern Compiler Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

PACE: Power-Aware Computing Engines

CSE 392/CS 378: High-performance Computing - Principles and Practice

COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: C Multiple Issue Based on P&H

Adaptive Scientific Software Libraries

Initial Performance Evaluation of the Cray SeaStar Interconnect

Lecture 21: Parallelism ILP to Multicores. Parallel Processing 101

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Evolution of Computers & Microprocessors. Dr. Cahit Karakuş

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

Leveraging OpenSPARC. ESA Round Table 2006 on Next Generation Microprocessors for Space Applications EDD

Constructing Virtual Architectures on Tiled Processors. David Wentzlaff Anant Agarwal MIT

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

EECS4201 Computer Architecture

Introduction to CELL B.E. and GPU Programming. Agenda

Advanced Processor Architecture

Transcription:

TRIPS: A Distributed Explicit Data Graph Execution (EDGE) Microprocessor Madhu Saravana Sibi Govindan, Doug Burger, Steve Keckler, and the TRIPS Team Hot Chips 19, August 2007 Computer Architecture and Technology Laboratory Department of Computer Sciences www.cs.utexas.edu/users/trips Recent Trends Scaling challenges for conventional superscalar processors Power and pipeline limits impede clock rate growth Wire delays cause overheads for concurrency Complexity of large monolithic architectures Industry shift to multicore architectures Work well for certain types of workloads Big challenges: how to program/amdahl s law Single-thread performance is still important Hot Chips 19, August 2007 2

TRIPS A Technology Scalable Architecture Goals High single-thread performance through ILP Exploit concurrency at multiple granularities Scalable with technology trends Key technologies Explicit data graph execution (EDGE) ISA Distributed processor microarchitecture Distributed non-uniform (NUCA) L2 cache Tiled and networked design Hot Chips 2005 - Chip in RTL design phase Hot Chips 2007 - Chip/system complete Manufacturing/bring-up complete Performance tuning in progress This talk focuses on chip and system implementation Hot Chips 19, August 2007 3 Outline Explicit Data Graph Execution (EDGE) ISAs ISA support for distributed execution TRIPS networked microarchitecture Processor microarchitecture Non-uniform cache memory system TRIPS chip and system implementation Custom ASIC chip and system boards Preliminary performance results Hot Chips 19, August 2007 4

Explicit Data Graph Execution (EDGE) Two key features Program broken into sequence of instruction blocks Execution model: fetch, execute, and commit blocks atomically Amortize overheads over many instructions Within a block, instructions explicitly encode producer-consumer communication Producer instructions explicitly targets consumers and send results directly to them (without using a register file) Instructions fire when all operands arrive Any instruction may be predicated (conditionally executed) TRIPS blocks - up to 128 instructions Compile-time techniques to create large blocks Average block sizes typically greater than 45 instructions Long-term goal: hide hard-to-predict branches inside blocks Hot Chips 19, August 2007 5 TRIPS Execution Model Program Control Flow Graph (CFG) Basic Block if-then-else example if (p==0) z = a * 2 + 3; else z = b * 3 + 4; TRIPS block TRIPS block for if-then-else a p teqz b TRIPS block muli muli addi addi st(1) CFG Loop Example if-then-else Example Hot Chips 19, August 2007 6

Block 0 TRIPS Execution Model Time (cycles) 0 5 10 30 40 B i FETCH Block 1 B i+1 EXECUTE (variable execution time) Block 2 B i+2 Block 3 B i+3 COMMIT Block 8 B i+8 Structural dependence In-flight blocks/instructions Block 4 B i+4 Block 5 B i+5 Block 6 B i+6 Block 7 B i+7 Fetch/Execute/Commit overlapped across multiple blocks Can execute up to 8 blocks at a time via speculation Exposes concurrency with a very large instruction window Single-threaded mode with up to 1024 instructions Simultaneous multi-threaded (SMT) mode with up to 4 threads and 256 instruction window per thread Hot Chips 19, August 2007 7 TRIPS Prototype Chip 2 TRIPS Processors NUCA L2 Cache 1 MB, 16 banks On-Chip Network (OCN) 2D mesh network Replaces on-chip bus Controllers 2 DDR SDRAM controllers 2 DMA controllers External Bus Controller (EBC) Interfaces with PowerPC 440GP (control processor) Chip-to-Chip (C2C) network controller Clocking 2 PLLs 4 Clock domains 1x and 2x SDRAM Main and C2C Clock tree Main domain has 4 quadrants to limit local skew DMA DMA EBI (External Bus Interface) DDR SDRAM 108 SDC OCN NUCA L2 Cache SDC EBC C2C 108 8x39 DDR SDRAM C2C Links TEST PROC 0 PROC 1 PLLS Hot Chips 19, August 2007 8 44 IRQ (Interrupt Request) GPIO 16 JTAG CLK

TRIPS Microarchitecture Principles Distributed and tiled architecture Small and simple tiles (register file, data cache bank, etc.) Short local wires Tiles are small: 2-5 mm 2 per tile is typical No centralized resources Networks connect the tiles Networks implement distributed protocols (I-fetch, bypass, etc.) Includes well-defined control and data networks Networks connect only nearest neighbors No global wires Design modularity and scalability Design productivity by replicating tiles (design reuse) Networks extensible, even late in design cycle Hot Chips 19, August 2007 9 TRIPS Tile-level Microarchitecture TRIPS Tiles G:Processor control - TLB w/ variable size pages, dispatch, next block predict, commit R: Register file - 32 registers x 4 threads, register forwarding I: Instruction cache - 16KB storage per tile D: Data cache - 8KB per tile, 256-entry load/store queue, TLB E: Execution unit - Int/FP ALUs, 64 reservation stations M: Memory - 64KB, configurable as L2 cache or scratchpad N: OCN network interface - router, translation tables DMA: Direct memory access controller SDC: DDR SDRAM controller EBC: External bus controller - interface to external PowerPC C2C: Chip-to-chip network controller - 4 links to XY neighbors Hot Chips 19, August 2007 10

TRIPS Processor EDGE ISA blocks mapped to array of tiles Compile-time scheduler decides where (not when) instructions execute TRIPS: aggressive processor capabilities Up to 16 instructions per cycle Up to 4 loads/stores per cycle Up to 64 outstanding L1 data cache misses Up to 1024 dynamically scheduled instructions Up to 4 simultaneous multithreading (SMT) threads Memory system 4 simultaneous L1 cache fills per processor Up to 16 simultaneous L2 cache accesses Up to 16 outstanding L2 cache misses Hot Chips 19, August 2007 11 Non-Uniform (NUCA) L2 Cache 1MB L2 cache Sixteen tiled 64KB banks On-chip network 4x10 2D mesh topology 128-bit links, 366MHz (4.7GB/sec) 4 virtual channels prevent deadlocks Requests and replies are wormholerouted across the network Up to 10 memory requests per cycle Up to 128 bytes per cycle returned to the processors Individual banks reconfigurable as scratchpad PROC 0 PROC 1 Request Reply Hot Chips 19, August 2007 12

TRIPS Project Timeline 1999 2000 2001 2002 2003 2004 2005 2006 2007 Technology trend analysis What are the problems to solve? New architecture concepts, invention, publishing What are the solutions? UT-Austin team 12 graduate students + 1 engineer RTL, verification, timing IBM ASIC team Physical design Prototype implementation Is it feasible? Prototype design Solve detailed challenges of architecture. Prototype testing. Does it work; how well? Unveiling Hot Chips 19, August 2007 13 TRIPS Chip Implementation Process Technology Die Size Package Pin Count # of placed cells Transistor count (est.) # of routed nets Total wire length Power (measured) Clock period 130nm ASIC with 7 metal layers 18.3mm x 18.37mm (336 mm 2 ) 47mm x 47mm BGA 626 signals, 352 Vdd, 348 GND 6.1 million 170 million 6.5 million 1.06 km 36W at 366MHz (chip has no power mgt.) 2.7ns (actual) 4.5ns (worse case sim) Hot Chips 19, August 2007 14

Die Photos With C4 Array Without C4 Array Hot Chips 19, August 2007 15 Benefits of Tiled Design Design modularity 11 different tiles, instantiated a total of 106 times Clean interfaces at tile boundaries Verification - no hardware bugs Tiles verified extensively before stitching together Place and Route - hierarchical in nature Wiring only between nearest neighbors But - each physical instance was a little different Timing - trivial at top level, communication planned No global wires or timing paths Hot Chips 19, August 2007 16

TRIPS Daughtercard Voltage Regulator Heatsink/Fan (rated to 90 W) 2 x 1GB DIMMs 12V to daughtercard, stepped down to 1.5V, 2.5V, and 3.3V 12 GFlops peak 45 W total (worst case) Hot Chips 19, August 2007 17 TRIPS Motherboard 1 motherboard includes: 4 daughter-boards 4 TRIPS chips 8 GBytes DRAM PowerPC 440GP control processor I/O: ethernet, serial, C2C links FPGA I/O interface Peak performance 48 GFlops at 366 MHz 180 Watts Hot Chips 19, August 2007 18

TRIPS Multi-board System Extensible to 8 boards (64 processors) Micro-coax cables extend C2C network across boards Peak performance 380 GFlops, 1.4KW Parallel message-passing software Hot Chips 19, August 2007 19 TRIPS Prototype System Layers Ethernet Switch Board 0 EBC 0 EBC 2 PowerPC EBI 440GP IRQ HOST PC x86 Linux EBC EBC P1 3 TRIPS Resource Manager (TRM) File system Runtime services Login/debug/etc. Board 1 Board 2 Local Resoure Manager (LRM) listens to HostPC Runs embedded Linux PowerPC EBI device driver to control TRIPS chips PowerPC EBITRIPS EBC Runs TRIPS apps Interrupts PowerPC if necessary System calls, exceptions Hot Chips 19, August 2007 20

Preliminary Performance (HW) Challenges Different technology and ISAs Different processor-to-memory clock ratio TRIPS compiler fine-tuning in progress Cycle-to-cycle comparison on multiple HW platforms TRIPS Performance counters PAPI - Performance API on Linux systems for others Applications Compiled + hand-optimized Mix of kernels and full algorithms Compiled only The Embedded Microprocessor Benchmark Consortium (EEMBC) Versabench (MIT) SPEC benchmarks in progress Processor TRIPS Core 2 Pentium 4 Pentium 3 Clock Speed 366 MHz 1.6 GHz (underclocked) 3.6 GHz 450 MHz Memory Speed 200 MHz DDR 800 MHz DDR2 533 MHz DDR2 100 MHz SDRAM Process Technology 130 nm 65 nm 90 nm 250 nm Hot Chips 19, August 2007 21 TRIPS vs. Conventional Processors: Kernels 8.00 Speedup Relative to Core 2 (cycles) 7.00 6.00 5.00 4.00 3.00 2.00 TRIPS-Hand optimized TRIPS-Compiled Pentium 3 Pentium 4 1.00 Core 2 0.00 802.11a 8b10b a2time rspeed vadd conv matrix autocor ct fmradio Geometric Mean BIT INT FLOAT STREAM Hot Chips 19, August 2007 22

TRIPS vs. Conventional Processors EEMBC and signal processing (compiled) 2.5 TRIPS-Compiled Pentium 3 Speedup Relative to Core 2 (cycles) 2 1.5 1 0.5 Pentium 4 Core 2 0 aifftr aiifft basefp cacheb iirflt matrix pntrch puwmod rspeed ospf bezier text autocor Graph shows representative subset of 33 benchmarks and geometric mean for all 33 benchmarks fft conv ct Geometric Mean Hot Chips 19, August 2007 23 Summary TRIPS prototype demonstrates feasibility of: Explicit Data Graph Execution (EDGE) ISAs Distributed processor and memory microarchitectures Scaled and tiled uniprocessors Non-uniform cache architectures (NUCA) Recompilation required, but no change to source code Performance is promising First generation TRIPS is 3.1x faster for hand-optimized code Compiled-code reasonable and improving Identified several opportunities for improvements Enhanced operand networks Dynamic aggregation of processor cores TRIPS-like architectures are a potential alternative to multicore Hot Chips 19, August 2007 24

Acknowledgements TRIPS Hardware Team Raj Desikan, Saurabh Drolia, Madhu Sibi Govindan, Divya Gulati, Paul Gratz, Heather Hanson, Changkyu Kim, Haiming Liu, Robert McDonald, Ramdas Nagarajan, Nitya Ranganathan, Karu Sankaralingam, Simha Sethumadhavan, Premkishore Shivakumar TRIPS Software Team Kathryn McKinley, Jim Burrill, Xia Chen, Sundeep Kushwaha, Bert Maher, Nick Nethercote, Suriya Narayanan, Sadia Sharif, Aaron Smith, Bill Yoder IBM Microelectronics Austin ASIC Group TRIPS Sponsors DARPA Polymorphous Computing Architectures Air Force Research Laboratories National Science Foundation IBM, Intel, Sun Microsystems Hot Chips 19, August 2007 25