Leveraging OpenSPARC. ESA Round Table 2006 on Next Generation Microprocessors for Space Applications EDD

Similar documents
Reference. T1 Architecture. T1 ( Niagara ) Case Study of a Multi-core, Multithreaded

Kaisen Lin and Michael Conley

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

ProtoFlex: FPGA Accelerated Full System MP Simulation

How much energy can you save with a multicore computer for web applications?

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

Thread-level Parallelism for the Masses. Kunle Olukotun Computer Systems Lab Stanford University 2007

Hardware and Software solutions for scaling highly threaded processors. Denis Sheahan Distinguished Engineer Sun Microsystems Inc.

Multithreaded Processors. Department of Electrical Engineering Stanford University

Gemini: Sanjiv Kapil. A Power-efficient Chip Multi-Threaded (CMT) UltraSPARC Processor. Gemini Architect Sun Microsystems, Inc.

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Computer Architecture. Introduction. Lynn Choi Korea University

eslim SV Dual and Quad-Core Xeon Server Dual and Quad-Core Server Computing Leader!! ESLIM KOREA INC.

CSE502: Computer Architecture CSE 502: Computer Architecture

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced Processor Architecture

CMSC 411 Computer Systems Architecture Lecture 2 Trends in Technology. Moore s Law: 2X transistors / year

Computer Architecture Spring 2016

Niagara-2: A Highly Threaded Server-on-a-Chip. Greg Grohoski Distinguished Engineer Sun Microsystems

CS425 Computer Systems Architecture

SoC Platforms and CPU Cores

Flexible Architecture Research Machine (FARM)

The Nios II Family of Configurable Soft-core Processors

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

Adapted from David Patterson s slides on graduate computer architecture

PERFORMANCE MEASUREMENT

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

OPENSPARC T1 OVERVIEW

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

45-year CPU Evolution: 1 Law -2 Equations

Computer Systems Architecture

Copyright 2012, Elsevier Inc. All rights reserved.

The Processor: Instruction-Level Parallelism

Computer Architecture

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle

History. PowerPC based micro-architectures. PowerPC ISA. Introduction

GRVI Phalanx. A Massively Parallel RISC-V FPGA Accelerator Accelerator. Jan Gray

LECTURE 5: MEMORY HIERARCHY DESIGN

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

Processor (IV) - advanced ILP. Hwansoo Han

Computer Systems Architecture

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

HW Trends and Architectures

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Industry perspective on Chip Multi-Threading: Bridging the gap with academia using OpenSPARC

Introduction to the MMAGIX Multithreading Supercomputer

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

An FPGA Host-Multithreaded Functional Model for SPARC v8

Lecture 14: Multithreading

An Oracle White Paper February Enabling End-to-End 10 Gigabit Ethernet in Oracle s Sun Netra ATCA Product Family

CIT 668: System Architecture. Computer Systems Architecture

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Lec 25: Parallel Processors. Announcements

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources

Overview of the SPARC Enterprise Servers

5008: Computer Architecture

SoC FPGAs. Your User-Customizable System on Chip Altera Corporation Public

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

KeyStone II. CorePac Overview

Embedded Systems. 7. System Components

Advanced Instruction-Level Parallelism

Lecture 1: Introduction

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

FAULT TOLERANT SYSTEMS

Case Study 1: Optimizing Cache Performance via Advanced Techniques

H100 Series FPGA Application Accelerators

Keywords and Review Questions

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.

AMD Opteron 4200 Series Processor

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

The RM9150 and the Fast Device Bus High Speed Interconnect

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

Microprocessor Trends and Implications for the Future

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Memory Hierarchy. Slides contents from:

SGI Challenge Overview

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Early Models in Silicon with SystemC synthesis

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Intel Enterprise Processors Technology

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

POWER7: IBM's Next Generation Server Processor

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

CPU Project in Western Digital: From Embedded Cores for Flash Controllers to Vision of Datacenter Processors with Open Interfaces

PowerPC 740 and 750

The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights

Pipelining to Superscalar

Transcription:

Leveraging OpenSPARC ESA Round Table 2006 on Next Generation Microprocessors for Space Applications G.Furano, L.Messina TEC-

OpenSPARC T1 The T1 is a new-from-the-ground-up SPARC microprocessor implementation that conforms to the UltraSPARC architecture 2005 specification and executes the full SPARC V9 instruction set. Sun has produced two previous multicore processors: UltraSPARC IV and UltraSPARC IV+, but UltraSPARC T1 is its first microprocessor that is both multicore and multithreaded. The processor is available with 4, 6 or 8 CPU cores, each core able to handle four threads. Thus the processor is capable of processing up to 32 threads concurrently. Designed to lower the energy consumption of server computers, the 8-cores CPU uses typically 72 W of power at 1.2 GHz. G.Furano, L.Messina TEC-

72W 1.2 GHz 90nm Is a cutting edge design, targeted for high-end servers. NOT FOR SPACE USE But, let s see which are the potential spin-in G.Furano, L.Messina TEC-

Why OPEN? On March 21, 2006, Sun made the UltraSPARC T1 processor design available under the GNU General Public License. The published information includes: Verilog source code of the UltraSPARC T1 design, including verification suite and simulation models ISA specification (UltraSPARC Architecture 2005) The Solaris 10 OS simulation images Diagnostics tests for OpenSPARC T1 Scripts, open source and Sun internal tools needed to simulate the design and to do synthesis of the design Scripts and documentation to help with FPGA implementation of parts of OpenSPARC T1 design including SPARC core, Floating point Unit, Cross-bar G.Furano, L.Messina TEC-

Target market : driving reqs Walls facing server CPUs: Memory latency (75%!), load-load dependencies? Fine multithreading (1 cycle context switch, 640 64-bit register file, round robin) Branch prediction? minimal (no BPU, OOO buffers, pipes, bypass) Cache coherency overhead? Per-core L1 I&D caches (3 cycles) connected to shared L2 (MESI, 12-way assoc, 64B line) and memory (8 cycles) over 200 GB/s interconnect ( SMP on chip, Dual Opteron) Power consumption? Throttling of threads/cores, lower temperature increases reliability due to lower wearout and drift What are they selling? Many customers invested in Solaris, thread-rich (Java), modularity given per core (Oracle)/chip (usoft) licensing G.Furano, L.Messina TEC-

Throughput vs. Utilisation T1: 5-stage (RISC2) plus Thread Select stage, < 1 CPI -> 9.7 GOPS Superpipelining (PPC970) faces memory wall, Instruction Level Parallelism limits Large-scale parallel (Beowulf) faces power wall -> smaller clusters Superscalar (dynamic) cores limited by complexity required to find parallel instructions being ~proportional to square of number of instructions that can be issued simultaneously. Superscalar cores to improve resource usage with coarse/event (IBM Northstar), fine, or simultaneous multi-threading (Intel P4 hyperthreading, IBM POWER5): higher IPC; 2-3x increase in clock speed allows two dual or one quad core "fatter but fewer" CPUs to approach T1 Weaknesses of T1: shallow pipeline, clock speed, large SPARC-defined register window, no FPU G.Furano, L.Messina TEC-

T2: Refining T1 90nm, 1.2 GHz T1 -> 65nm T2, 1.4+ GHz 4 threads/6-stage pipeline/core -> 8 threads/8 or 12-stage pipeline/2-execution unit (thread group) core Memory controllers (4), 400 MHz 128-bit memory bus, DDR2 SDRAM (8 cycles) -> FB-DIMMs JBUS -> 8x PCIExpress, two 10 Gigabit ports with 200KB packet classification and filtering (layers 2 thru 4) Per core cryptographic coprocessor with MAU and hash and cypher sub-engines L2 size increased to 4 MB (8-banks, 16-way associative) Shared FPU (<1% instr) -> per core FPU G.Furano, L.Messina TEC-

JEDEC Standard introduces an Advanced Memory Buffer (AMB) between the memory controller and the memory module. Unlike the parallel bus architecture of traditional DRAMs, a FB-DIMM has a serial interface between the memory controller and the AMB Fully Buffered DIMM G.Furano, L.Messina TEC-

Reliability, Availability, Serviceability Registers and L2 with SEC/DED, L1 I- and D- caches and TLBs with parity error correction Main memory : correct any error within single nibble, detect any uncorrectable errors within any 2 nibbles (Galois Field for higher bandwidth than Hamming) DRAM sparing (degraded protection till service) Memory scrubber (64B/line, programmable) Hardware retry on cache error G.Furano, L.Messina TEC-

Why this can be interesting for us Results of FPGA mapping: Preliminary results of FPGA mapping to Xilinx XC4VLX200 and Altera EP2S180 parts are listed below. Xilinx XC4VLX200 SPARC FPU CCX LUTs 134973 13863 25090 Utilization 75% 7% 14% Altera EP2S180 SPARC FPU CCX Estimated ALMs 74280 6855 19589 Utilization 103% 10% 27% G.Furano, L.Messina TEC-

Some Areas to Consider Active, local mgmt of power, temp (reliability), soft errors Special function integration : networking, cryptography Use of SerDes for Ethernet, PCI-Express, FB-DIMM T1: Software-defined, not instruction-defined, architecture Multi-threaded (Sybase), multi-process apps (Oracle, SAP, PeoplSoft) Bare bones threaded (Java-based, e.g. JVM) Multi-instance applications Allocate entire core (w/ performance impact) to real-time threads Design independent or cache-positive interactive threads Concurrency separates independent control flows, takes advantage of latency : applications need to be concurrent Improving usage and utilisation : addition of transactional hardware to CMPs (transactional memory soon commercially available) makes concurrent programming models easier G.Furano, L.Messina TEC-

G.Furano, L.Messina TEC- Learn from a code that works Learn from the real chip Explore the code that works. Learn techniques that have been tested to create low-power, highly productive chips. Examples of some of the research areas being explored with OpenSPARC technology include: Performance analysis and modeling. EDA development and optimization for Multicore chips. SoftCores for Multicore FPGA implementations. Optimizing and developing SW for multi-core systems. EDA interests areas like simulation/acceleration, formal verification, place and route algorithm, testability, modeling in SystemVerilog, SystemC. Developing multi-threaded EDA tools

What is going on while we discuss Simply RISC has shipped the S1 Core, a 64- bit Wishbone-compliant single CPU Core based upon the OpenSPARC T1 The S1 Core is released under the same license of the T1, the GNU General Public License (GPL); the design is freely downloadable from the Simply RISC website @ www.srisc.com G.Furano, L.Messina TEC-