Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014

Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector and scalar functional units At the time, was the world s fastest scalar processor (recall Amdahl s law) Can have up to 1 million words of memory (64-bit words) 10,500 pounds, consumes 115 kilowatts of power Physical dimensions in paper figures 1, 2, 3, 4 Portland State University ECE 588/688 Fall 2014 2

Cray-1 Architecture Has both scalar and vector processing modes 12.5 ns clock (80 MHz) Word size: 64-bits Twelve functional units Register types and count (paper figure 5) 24-bit address (A) registers: 8 24-bit intermediate address (B) registers: 24 64-bit scalar (S) registers: 8 64-bit intermediate scalar (T) registers: 64 64-element vector (V) registers, each element is 64-bits: 8 Vector length and vector mask registers 64-bit real time clock (RT) register: 1 4 instruction buffers, each 64 parcels (16 bits per parcel) Portland State University ECE 588/688 Fall 2014 3

Cray-1 Memory and I/O 1 M words (2 20 ), each word containing 64-bits + 8 check bits 16 independent memory banks, each 64K words 4 clock period bank cycle time (20 MHz) Bandwidth: Transfer 1 word per cycle for B, T, and V registers Transfer 1 word every 2 cycles for A and S registers Transfer 4 words every cycle to instruction buffers Cray-1 doesn t have caches why? I/O Four 6-channel groups of I/O channels Each channel group served by memory every 4 cycles Portland State University ECE 588/688 Fall 2014 4

Cray-1 Implementation Details Instruction formats (paper Table II) Register types and supporting registers A vector operation can have the following sources Two vector register operands One vector register operand and one scalar register operand Parallel vector operations can be processed in two ways Using different functional units and V registers Chaining: Using the result stream to one vector register simultaneously as the operand set for another operation in a different functional unit Avoids overhead of storing intermediate results Portland State University ECE 588/688 Fall 2014 5

Cray T3E Multiprocessor Implements logical shared address space over a distributed memory architecture Each processing element contains (paper figure 1) DEC Alpha 21164 processor 8KB L1 I-cache, 8KB L1 D-cache 96 KB 3-way L2 cache Allows two outstanding 64-byte cache line fills Control chip Router 64 MB to 2 GB of memory T3E has up to 2K processors connected by a 3D torus Portland State University ECE 588/688 Fall 2014 6

Cray T3E E-Registers Memory interface is augmented with external (E) registers 512 user + 128 system registers Explicitly managed All remote synchronization and communication done between E registers and memory E registers extend the physical address space of a processor to cover full machine physical memory E-register operations: Direct loads and stores between E regs & processor regs Global E-register operations Transfer data to/from remote or local memory Perform messaging and atomic operation synchronization 21164 has cacheable memory space & non-cacheable I/O space Most significant bit of 40-bit address distinguishes two types I/O space used to access memory-mapped registers, including E registers Portland State University ECE 588/688 Fall 2014 7

Cray T3E Global Communication Global virtual address (GVA) components: paper figure 2 Address translation for global references: paper figure 4 Global operations on E registers Gets: read memory into E-register Puts: write E-register to memory Both can operate on single word (32-bit or 64-bit) or vector (8 words) with arbitrary stride Gets and Puts can be highly pipelined due to large number of E- registers Maximum transfer rate between two nodes using vector Gets or Puts is 480 MB/s Portland State University ECE 588/688 Fall 2014 8

Cray T3E Synchronization Atomic memory operations: paper Table 1 Barriers and Eurekas synchronization Barriers allow a set of participating processors to determine when all processors have signaled some event (e.g., reached a certain point in program execution) Eurekas allow a set of processors to determine when any one processor has signaled an event (e.g., completion of parallel search) T3E has 32 barrier/eureka synchronization units (BSU) at each processor Accessed as memory-mapped registers States and events: paper Table 2 & 3 State transition diagrams: paper figure 6 Portland State University ECE 588/688 Fall 2014 9

Portland State University ECE 588/688 IBM Power4 System Microarchitecture Copyright by Alaa Alameldeen 2014

IBM Power4 Design Principles SMP optimization Designed for high-throughput multi-tasking environments Full system design approach Whole system designed together, processor designed with full system in mind High frequency design Important for single-threaded applications RAS: Reliability, Availability, and Serviceability Balanced scientific vs. commercial performance Good performance for both high-performance computing scientific applications & commercial server applications Binary compatibility with previous IBM processors Portland State University ECE 588/688 Fall 2014 11

Power4 Chip Features Two processors on a chip (Figure 1, die photo in Figure 2) Each processor has private L1 caches Both processors share an on-chip L2 cache through a core interface unit (CIU) Crossbar between two processors I and D L1 caches and three L2 controllers Each L2 controller can feed 32B per cycle Accepts 8B processor stores to L2 controllers Each processor has a noncacheable unit (NC) Logically part of L2, handles noncacheable operations L3 directory and L3 controller on chip Actual L3 cache on separate chip Fabric controller controls data flow between L2 & L3 controller Portland State University ECE 588/688 Fall 2014 12

Power4 Processor Features On-chip, two identical processors provide two-way SMP to software (an example for chip multiprocessing) Each processor is a superscalar out-of-order processor Issue width: up to 8, retire width: 5 8 instruction units, each capable of issuing one inst/cycle Two floating point execution units, each can start an FP add and FP multiply every cycle Two load/store units, each can perform address generation arithmetic Two fixed point execution units Branch execution unit Condition register logical execution unit Core block diagram: paper figure 3 Portland State University ECE 588/688 Fall 2014 13

Power4 Microarchitecture Complex branch prediction Branch target and direction prediction Has a selector table to choose between a Local branch history table and global history vector Selective pipeline flush on branch misprediction Instructions are decoded, cracked into internal instructions (IOPs), then grouped into five-instruction groups Fifth IOP is always a branch Groups dispatched in order, IOPs in a group issued out of order Whole group committed together (up to 5 IOPs) Issue queues: paper table 1, rename resources: table 2 Pipeline in paper figure 4 Portland State University ECE 588/688 Fall 2014 14

Load/Store Unit Operation Main structures Load Reorder Queue (LRQ), i.e., load buffer Store Reorder Queue (SRQ), i.e., store address buffer Store Data Queue (SDQ) Hazards avoided by Load/store unit Load hit store (RAW1): Younger load executes before older store writes data to memory. Load should get data from SDQ. Possible flush or reissue Store hit load (RAW2): Younger load executes before recognizing an older store will write to same location. Store checks LRQ and flushes all subsequent groups on hit Load hit load (RAR): If younger load got old data, older load should not get new data. Older load checks snooping bit in LRQ for younger loads, flushes all subsequent groups on hit Portland State University ECE 588/688 Fall 2014 15

Memory Hierarchy Memory hierarchy details in paper table 3 L2 logical view in paper figure 5 L3 logical view in paper figure 6 Memory subsystem logical view in paper figure 7 Hardware prefetching Eight sequential stream prefetchers per processor Prefetch data to L1 from L2, to L2 from L3, and to L3 from memory Streams initiated when processor misses sequential cache access L3 prefetches 512B lines Portland State University ECE 588/688 Fall 2014 16

Cache Coherence Each L2 controller has four coherency processors to handle requests from either processor s caches or store queues Controls return of data from L2 (hit) or fabric controller (miss) to the requesting processor Updates L2 directory state Issues commands to fabric on L2 misses Controls writing to L2 Initiates invalidates to a processor if a processor s store hits a cache line marked as being resident in another processor s L1 L2 controller has four snoop processors to handle coherency operations from fabric Can source data to another L2 from this L2 Portland State University ECE 588/688 Fall 2014 17

Coherence Protocol L2 has enhanced version of MESI (paper table 4) I: Invalid SL: Shared, can be sourced to local requesters Entered when processor load or I-fetch misses L2, data is sourced from another L2 or from memory S: Shared, cannot be sourced Entered when another processor snoops cache in SL state M: Modified, can be sourced Entered on processor store Me: Exclusive Mu: Unsolicited modified Entered when data is sourced from another L2 in M state T: Tagged (valid, modified, sourced to another L2) Entered on a snoop read from M state L3 has simpler protocol (paper) Portland State University ECE 588/688 Fall 2014 18

Connecting into larger SMPs Basic building block is Multi-Chip Module (MCM) Four Power4 chips form an 8-way SMP (paper figure 9) Each chip writes to its own bus (with arbitration among L2, I/O controller and L3 contoller) Each of the four chips snoops all buses 1-4 MCMs can be connected to form 8-way, 16-way, 24- way and 32-way SMPs 32-way SMP shown in paper figure 10 Intermodule buses act as repeaters, moving requests and responses from one module to another in a ring topology Each chip writes to its own bus but snoops all buses Portland State University ECE 588/688 Fall 2014 19

No class Tuesday Thursday Reading Assignment Erik Lindholm et al., Nvidia Tesla: A Unified Graphics and Computing Architecture, IEEE Micro, 2008 (Review) Tuesday 11/18 John Mellor-Crummey and Michael Scott, Synchronization Without Contention, ACM Transactions on Computer Systems,1991 (Review) Thomas Anderson, The Performance of Spin-Lock Alternatives, IEEE Transactions on Parallel and Distributed Systems, 1990 (Skim) Ravi Rajwar and James Goodman, Speculative Lock Elision: Enabling Highly-concurrent Multithreaded Execution, Micro 2001 (Skim) Project progress report due 11/18 Portland State University ECE 588/688 Fall 2014 20