Portland State University ECE 588/688. Cray-1 and Cray T3E

Similar documents
Portland State University ECE 588/688. Cray-1 and Cray T3E

Introduction to Parallel Computing

Portland State University ECE 588/688. Memory Consistency Models

ECE 588/688 Advanced Computer Architecture II

ECE 588/688 Advanced Computer Architecture II

Portland State University ECE 588/688. Graphics Processors

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James

ECE 587 Advanced Computer Architecture I

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Portland State University ECE 588/688. Dataflow Architectures

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

Portland State University ECE 588/688. IBM Power4 System Microarchitecture

Portland State University ECE 587/687. Caches and Prefetching

ECE/CS 757: Advanced Computer Architecture II Massively Parallel Processors

Synchronization and Communication in the T3E Multiprocessor

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System?

Uniprocessor Computer Architecture Example: Cray T3E

EE382 Processor Design. Concurrent Processors

Advanced Topics in Computer Architecture

Portland State University ECE 588/688. Cache Coherence Protocols

Advanced Computer Architecture

CMPE 655 Multiple Processor Systems. SIMD/Vector Machines. Daniel Terrance Stephen Charles Rajkumar Ramadoss

Vector an ordered series of scalar quantities a one-dimensional array. Vector Quantity Data Data Data Data Data Data Data Data

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

GPU Fundamentals Jeff Larkin November 14, 2016

1. Creates the illusion of an address space much larger than the physical memory

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

TDT4260/DT8803 COMPUTER ARCHITECTURE EXAM

Digital Semiconductor Alpha Microprocessor Product Brief

Lecture 2 Parallel Programming Platforms

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Continuum Computer Architecture

UMBC. Rubini and Corbet, Linux Device Drivers, 2nd Edition, O Reilly. Systems Design and Programming

Figure 1-1. A multilevel machine.

Advanced cache optimizations. ECE 154B Dmitri Strukov

Systems Design and Programming. Instructor: Chintan Patel

CS 838 Chip Multiprocessor Prefetching

Lec 13: Linking and Memory. Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University. Announcements

06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli

Pipelined processors and Hazards

COSC 6385 Computer Architecture - Memory Hierarchies (III)

17 Vector Performance

EC 513 Computer Architecture

OpenMP on the IBM Cell BE

The T0 Vector Microprocessor. Talk Outline

The University of Texas at Austin

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

Three basic multiprocessing issues

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Lecture Topics. Principle #1: Exploit Parallelism ECE 486/586. Computer Architecture. Lecture # 5. Key Principles of Computer Architecture

Lecture 7: Implementing Cache Coherence. Topics: implementation details

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell. COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University

Jim Keller. Digital Equipment Corp. Hudson MA

NAME: Problem Points Score. 7 (bonus) 15. Total

CSEE W4824 Computer Architecture Fall 2012

CPE/EE 421 Microcomputers

ELE 375 Final Exam Fall, 2000 Prof. Martonosi

Portland State University ECE 588/688. Cache-Only Memory Architectures

ECE 485/585 Midterm Exam

PowerPC 740 and 750

The Alpha and Microprocessors:

Lecture 20: Memory Hierarchy Main Memory and Enhancing its Performance. Grinch-Like Stuff

Case Study 1: Optimizing Cache Performance via Advanced Techniques

Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

Preliminary, February Full 64-bit Alpha architecture: - Advanced RISC architecture - Optimized for high performance

INTELLIGENCE PLUS CHARACTER - THAT IS THE GOAL OF TRUE EDUCATION UNIT-I

MULTIPROCESSOR system has been used to improve

Portland State University ECE 587/687. Superscalar Issue Logic

Lecture 8: RISC & Parallel Computers. Parallel computers

Memory. From Chapter 3 of High Performance Computing. c R. Leduc

Snoop-Based Multiprocessor Design III: Case Studies

Scalable Distributed Memory Machines

CSE 262 Spring Scott B. Baden. Lecture 1 Introduction

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 21: Network Protocols (and 2 Phase Commit)

CS 153 Design of Operating Systems Winter 2016

15-740/ Computer Architecture, Fall 2011 Midterm Exam II

Microarchitecture Overview. Performance

COSC 6385 Computer Architecture - Memory Hierarchies (II)

Advanced Computer Architecture

CS650 Computer Architecture. Lecture 9 Memory Hierarchy - Main Memory

Advanced Computer Architecture

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell

SGI Challenge Overview

OpenMP on the IBM Cell BE

Lecture 17: Parallel Architectures and Future Computer Architectures. Shared-Memory Multiprocessors

Lecture 4: RISC Computers

Spring Prof. Hyesoon Kim

Portland State University ECE 588/688. Transactional Memory

Online Course Evaluation. What we will do in the last week?

EECS 322 Computer Architecture Superpipline and the Cache

Main Memory (Fig. 7.13) Main Memory

Lecture 2: CUDA Programming

Issues in Multiprocessors

Transcription:

Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2018

Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector and scalar functional units At the time, was the world s fastest scalar processor (recall Amdahl s law) Can have up to 1 million words of memory (64-bit words) 10,500 pounds, consumes 115 kilowatts of power Physical dimensions in paper figures 1, 2, 3, 4 Portland State University ECE 588/688 Winter 2018 2

Cray-1 Architecture Has both scalar and vector processing modes 12.5 ns clock (80 MHz) Word size: 64-bits Twelve functional units Register types and count (paper figure 5) 24-bit address (A) registers: 8 24-bit intermediate address (B) registers: 24 64-bit scalar (S) registers: 8 64-bit intermediate scalar (T) registers: 64 64-element vector (V) registers, each element is 64-bits: 8 Vector length and vector mask registers 64-bit real time clock (RT) register: 1 4 instruction buffers, each 64 parcels (16 bits per parcel) Portland State University ECE 588/688 Winter 2018 3

Cray-1 Memory and I/O 1 M words (2 20 ), each word containing 64-bits + 8 check bits 16 independent memory banks, each 64K words 4 clock period bank cycle time (20 MHz) Bandwidth: Transfer 1 word per cycle for B, T, and V registers Transfer 1 word every 2 cycles for A and S registers Transfer 4 words every cycle to instruction buffers Cray-1 doesn t have caches why? I/O Four 6-channel groups of I/O channels Each channel group served by memory every 4 cycles Portland State University ECE 588/688 Winter 2018 4

Cray-1 Implementation Details Instruction formats (paper Table II) Register types and supporting registers A vector operation can have the following sources Two vector register operands One vector register operand and one scalar register operand Parallel vector operations can be processed in two ways Using different functional units and V registers Chaining: Using the result stream to one vector register simultaneously as the operand set for another operation in a different functional unit Avoids overhead of storing intermediate results Portland State University ECE 588/688 Winter 2018 5

Cray T3E Multiprocessor Implements logical shared address space over a distributed memory architecture Each processing element contains (paper figure 1) DEC Alpha 21164 processor 8KB L1 I-cache, 8KB L1 D-cache 96 KB 3-way L2 cache Allows two outstanding 64-byte cache line fills Control chip Router 64 MB to 2 GB of memory T3E has up to 2K processors connected by a 3D torus Portland State University ECE 588/688 Winter 2018 6

Cray T3E E-Registers Memory interface is augmented with external (E) registers 512 user + 128 system registers Explicitly managed All remote synchronization and communication done between E registers and memory E registers extend the physical address space of a processor to cover full machine physical memory E-register operations: Direct loads and stores between E regs & processor regs Global E-register operations Transfer data to/from remote or local memory Perform messaging and atomic operation synchronization 21164 has cacheable memory space & non-cacheable I/O space Most significant bit of 40-bit address distinguishes two types I/O space used to access memory-mapped registers, including E registers Portland State University ECE 588/688 Winter 2018 7

Cray T3E Global Communication Global virtual address (GVA) components: paper figure 2 Address translation for global references: paper figure 4 Global operations on E registers Gets: read memory into E-register Puts: write E-register to memory Both can operate on single word (32-bit or 64-bit) or vector (8 words) with arbitrary stride Gets and Puts can be highly pipelined due to large number of E- registers Maximum transfer rate between two nodes using vector Gets or Puts is 480 MB/s Portland State University ECE 588/688 Winter 2018 8

Cray T3E Synchronization Atomic memory operations: paper Table 1 Barriers and Eurekas synchronization Barriers allow a set of participating processors to determine when all processors have signaled some event (e.g., reached a certain point in program execution) Eurekas allow a set of processors to determine when any one processor has signaled an event (e.g., completion of parallel search) T3E has 32 barrier/eureka synchronization units (BSU) at each processor Accessed as memory-mapped registers States and events: paper Table 2 & 3 State transition diagrams: paper figure 6 Portland State University ECE 588/688 Winter 2018 9

Monday Announcements J. Tendler et al., "Power4 System Microarchitecture," IBM Journal of Research and Development, 2002 (Review) B. Sinharoy et al., "Power5 System Microarchitecture," IBM Journal of Research and Development, 2005 (Skim) No Class on Wednesday Feb 21. Lecture to be recorded. Project progress report due Feb 26 Portland State University ECE 588/688 Winter 2018 10