Portland State University ECE 588/688. Cray-1 and Cray T3E

Similar documents
Portland State University ECE 588/688. IBM Power4 System Microarchitecture

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Graphics Processors

PowerPC 740 and 750

Power 7. Dan Christiani Kyle Wieschowski

IBM ^ POWER4 System Microarchitecture

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

1. PowerPC 970MP Overview

Computer Architecture Spring 2016

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

ECE 588/688 Advanced Computer Architecture II

Portland State University ECE 587/687. Superscalar Issue Logic

Superscalar Processors

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

IBM POWER4: a 64-bit Architecture and a new Technology to form Systems

Keywords and Review Questions

ECE 588/688 Advanced Computer Architecture II

Superscalar Machines. Characteristics of superscalar processors

Portland State University ECE 588/688. Dataflow Architectures

Portland State University ECE 587/687. Memory Ordering

Computer Science 146. Computer Architecture

CS 654 Computer Architecture Summary. Peter Kemper

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

EC 513 Computer Architecture

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

Chapter 9 Multiprocessors

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.

Handout 2 ILP: Part B

Handout 3 Multiprocessor and thread level parallelism

06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli

Hyperthreading 3/25/2008. Hyperthreading. ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01.

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

A Multiprocessor system generally means that more than one instruction stream is being executed in parallel.

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

Multiprocessors & Thread Level Parallelism

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Multi-threaded processors. Hung-Wei Tseng x Dean Tullsen

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Introduction to Parallel Computing

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

XT Node Architecture

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

POWER3: Next Generation 64-bit PowerPC Processor Design

CSC 631: High-Performance Computer Architecture

Lecture 7: Implementing Cache Coherence. Topics: implementation details

Computer Architecture Area Fall 2009 PhD Qualifier Exam October 20 th 2008

Getting CPI under 1: Outline

CS 152 Computer Architecture and Engineering

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

5008: Computer Architecture

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

CS425 Computer Systems Architecture

Computer Systems Architecture

Limitations of Scalar Pipelines

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Microarchitecture Overview. Performance

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Case Study IBM PowerPC 620

Lecture-13 (ROB and Multi-threading) CS422-Spring

Computer Systems Architecture

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

GPU Fundamentals Jeff Larkin November 14, 2016

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

Computer Architecture

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014

Advanced Computer Architecture

Lecture 17: Parallel Architectures and Future Computer Architectures. Shared-Memory Multiprocessors

Digital Semiconductor Alpha Microprocessor Product Brief

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

ECE 341. Lecture # 15

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

CS 152, Spring 2011 Section 8

AMD s Next Generation Microprocessor Architecture

SGI Challenge Overview

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Microarchitecture Overview. Performance

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

This Material Was All Drawn From Intel Documents

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Chapter 5. Multiprocessors and Thread-Level Parallelism

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

A Cache Hierarchy in a Computer System

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Transcription:

Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014

Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector and scalar functional units At the time, was the world s fastest scalar processor (recall Amdahl s law) Can have up to 1 million words of memory (64-bit words) 10,500 pounds, consumes 115 kilowatts of power Physical dimensions in paper figures 1, 2, 3, 4 Portland State University ECE 588/688 Fall 2014 2

Cray-1 Architecture Has both scalar and vector processing modes 12.5 ns clock (80 MHz) Word size: 64-bits Twelve functional units Register types and count (paper figure 5) 24-bit address (A) registers: 8 24-bit intermediate address (B) registers: 24 64-bit scalar (S) registers: 8 64-bit intermediate scalar (T) registers: 64 64-element vector (V) registers, each element is 64-bits: 8 Vector length and vector mask registers 64-bit real time clock (RT) register: 1 4 instruction buffers, each 64 parcels (16 bits per parcel) Portland State University ECE 588/688 Fall 2014 3

Cray-1 Memory and I/O 1 M words (2 20 ), each word containing 64-bits + 8 check bits 16 independent memory banks, each 64K words 4 clock period bank cycle time (20 MHz) Bandwidth: Transfer 1 word per cycle for B, T, and V registers Transfer 1 word every 2 cycles for A and S registers Transfer 4 words every cycle to instruction buffers Cray-1 doesn t have caches why? I/O Four 6-channel groups of I/O channels Each channel group served by memory every 4 cycles Portland State University ECE 588/688 Fall 2014 4

Cray-1 Implementation Details Instruction formats (paper Table II) Register types and supporting registers A vector operation can have the following sources Two vector register operands One vector register operand and one scalar register operand Parallel vector operations can be processed in two ways Using different functional units and V registers Chaining: Using the result stream to one vector register simultaneously as the operand set for another operation in a different functional unit Avoids overhead of storing intermediate results Portland State University ECE 588/688 Fall 2014 5

Cray T3E Multiprocessor Implements logical shared address space over a distributed memory architecture Each processing element contains (paper figure 1) DEC Alpha 21164 processor 8KB L1 I-cache, 8KB L1 D-cache 96 KB 3-way L2 cache Allows two outstanding 64-byte cache line fills Control chip Router 64 MB to 2 GB of memory T3E has up to 2K processors connected by a 3D torus Portland State University ECE 588/688 Fall 2014 6

Cray T3E E-Registers Memory interface is augmented with external (E) registers 512 user + 128 system registers Explicitly managed All remote synchronization and communication done between E registers and memory E registers extend the physical address space of a processor to cover full machine physical memory E-register operations: Direct loads and stores between E regs & processor regs Global E-register operations Transfer data to/from remote or local memory Perform messaging and atomic operation synchronization 21164 has cacheable memory space & non-cacheable I/O space Most significant bit of 40-bit address distinguishes two types I/O space used to access memory-mapped registers, including E registers Portland State University ECE 588/688 Fall 2014 7

Cray T3E Global Communication Global virtual address (GVA) components: paper figure 2 Address translation for global references: paper figure 4 Global operations on E registers Gets: read memory into E-register Puts: write E-register to memory Both can operate on single word (32-bit or 64-bit) or vector (8 words) with arbitrary stride Gets and Puts can be highly pipelined due to large number of E- registers Maximum transfer rate between two nodes using vector Gets or Puts is 480 MB/s Portland State University ECE 588/688 Fall 2014 8

Cray T3E Synchronization Atomic memory operations: paper Table 1 Barriers and Eurekas synchronization Barriers allow a set of participating processors to determine when all processors have signaled some event (e.g., reached a certain point in program execution) Eurekas allow a set of processors to determine when any one processor has signaled an event (e.g., completion of parallel search) T3E has 32 barrier/eureka synchronization units (BSU) at each processor Accessed as memory-mapped registers States and events: paper Table 2 & 3 State transition diagrams: paper figure 6 Portland State University ECE 588/688 Fall 2014 9

Portland State University ECE 588/688 IBM Power4 System Microarchitecture Copyright by Alaa Alameldeen 2014

IBM Power4 Design Principles SMP optimization Designed for high-throughput multi-tasking environments Full system design approach Whole system designed together, processor designed with full system in mind High frequency design Important for single-threaded applications RAS: Reliability, Availability, and Serviceability Balanced scientific vs. commercial performance Good performance for both high-performance computing scientific applications & commercial server applications Binary compatibility with previous IBM processors Portland State University ECE 588/688 Fall 2014 11

Power4 Chip Features Two processors on a chip (Figure 1, die photo in Figure 2) Each processor has private L1 caches Both processors share an on-chip L2 cache through a core interface unit (CIU) Crossbar between two processors I and D L1 caches and three L2 controllers Each L2 controller can feed 32B per cycle Accepts 8B processor stores to L2 controllers Each processor has a noncacheable unit (NC) Logically part of L2, handles noncacheable operations L3 directory and L3 controller on chip Actual L3 cache on separate chip Fabric controller controls data flow between L2 & L3 controller Portland State University ECE 588/688 Fall 2014 12

Power4 Processor Features On-chip, two identical processors provide two-way SMP to software (an example for chip multiprocessing) Each processor is a superscalar out-of-order processor Issue width: up to 8, retire width: 5 8 instruction units, each capable of issuing one inst/cycle Two floating point execution units, each can start an FP add and FP multiply every cycle Two load/store units, each can perform address generation arithmetic Two fixed point execution units Branch execution unit Condition register logical execution unit Core block diagram: paper figure 3 Portland State University ECE 588/688 Fall 2014 13

Power4 Microarchitecture Complex branch prediction Branch target and direction prediction Has a selector table to choose between a Local branch history table and global history vector Selective pipeline flush on branch misprediction Instructions are decoded, cracked into internal instructions (IOPs), then grouped into five-instruction groups Fifth IOP is always a branch Groups dispatched in order, IOPs in a group issued out of order Whole group committed together (up to 5 IOPs) Issue queues: paper table 1, rename resources: table 2 Pipeline in paper figure 4 Portland State University ECE 588/688 Fall 2014 14

Load/Store Unit Operation Main structures Load Reorder Queue (LRQ), i.e., load buffer Store Reorder Queue (SRQ), i.e., store address buffer Store Data Queue (SDQ) Hazards avoided by Load/store unit Load hit store (RAW1): Younger load executes before older store writes data to memory. Load should get data from SDQ. Possible flush or reissue Store hit load (RAW2): Younger load executes before recognizing an older store will write to same location. Store checks LRQ and flushes all subsequent groups on hit Load hit load (RAR): If younger load got old data, older load should not get new data. Older load checks snooping bit in LRQ for younger loads, flushes all subsequent groups on hit Portland State University ECE 588/688 Fall 2014 15

Memory Hierarchy Memory hierarchy details in paper table 3 L2 logical view in paper figure 5 L3 logical view in paper figure 6 Memory subsystem logical view in paper figure 7 Hardware prefetching Eight sequential stream prefetchers per processor Prefetch data to L1 from L2, to L2 from L3, and to L3 from memory Streams initiated when processor misses sequential cache access L3 prefetches 512B lines Portland State University ECE 588/688 Fall 2014 16

Cache Coherence Each L2 controller has four coherency processors to handle requests from either processor s caches or store queues Controls return of data from L2 (hit) or fabric controller (miss) to the requesting processor Updates L2 directory state Issues commands to fabric on L2 misses Controls writing to L2 Initiates invalidates to a processor if a processor s store hits a cache line marked as being resident in another processor s L1 L2 controller has four snoop processors to handle coherency operations from fabric Can source data to another L2 from this L2 Portland State University ECE 588/688 Fall 2014 17

Coherence Protocol L2 has enhanced version of MESI (paper table 4) I: Invalid SL: Shared, can be sourced to local requesters Entered when processor load or I-fetch misses L2, data is sourced from another L2 or from memory S: Shared, cannot be sourced Entered when another processor snoops cache in SL state M: Modified, can be sourced Entered on processor store Me: Exclusive Mu: Unsolicited modified Entered when data is sourced from another L2 in M state T: Tagged (valid, modified, sourced to another L2) Entered on a snoop read from M state L3 has simpler protocol (paper) Portland State University ECE 588/688 Fall 2014 18

Connecting into larger SMPs Basic building block is Multi-Chip Module (MCM) Four Power4 chips form an 8-way SMP (paper figure 9) Each chip writes to its own bus (with arbitration among L2, I/O controller and L3 contoller) Each of the four chips snoops all buses 1-4 MCMs can be connected to form 8-way, 16-way, 24- way and 32-way SMPs 32-way SMP shown in paper figure 10 Intermodule buses act as repeaters, moving requests and responses from one module to another in a ring topology Each chip writes to its own bus but snoops all buses Portland State University ECE 588/688 Fall 2014 19

No class Tuesday Thursday Reading Assignment Erik Lindholm et al., Nvidia Tesla: A Unified Graphics and Computing Architecture, IEEE Micro, 2008 (Review) Tuesday 11/18 John Mellor-Crummey and Michael Scott, Synchronization Without Contention, ACM Transactions on Computer Systems,1991 (Review) Thomas Anderson, The Performance of Spin-Lock Alternatives, IEEE Transactions on Parallel and Distributed Systems, 1990 (Skim) Ravi Rajwar and James Goodman, Speculative Lock Elision: Enabling Highly-concurrent Multithreaded Execution, Micro 2001 (Skim) Project progress report due 11/18 Portland State University ECE 588/688 Fall 2014 20