Roadrunner. By Diana Lleva Julissa Campos Justina Tandar

Similar documents
IBM Cell Processor. Gilbert Hendry Mark Kretschmann

Cell Broadband Engine. Spencer Dennis Nicholas Barlow

Cell Processor and Playstation 3

CellSs Making it easier to program the Cell Broadband Engine processor

All About the Cell Processor

Spring 2011 Prof. Hyesoon Kim

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

Amir Khorsandi Spring 2012

Technology Trends Presentation For Power Symposium

Parallel Computing: Parallel Architectures Jin, Hai

High Performance Computing: Blue-Gene and Road Runner. Ravi Patel

The University of Texas at Austin

( ZIH ) Center for Information Services and High Performance Computing. Event Tracing and Visualization for Cell Broadband Engine Systems

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

Sony/Toshiba/IBM (STI) CELL Processor. Scientific Computing for Engineers: Spring 2008

Revisiting Parallelism

A Brief View of the Cell Broadband Engine

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

How to Write Fast Code , spring th Lecture, Mar. 31 st

Xbox 360 high-level architecture

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Introduction to CELL B.E. and GPU Programming. Agenda

Introduction to Computing and Systems Architecture

Computer Architecture

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Optimizing Data Sharing and Address Translation for the Cell BE Heterogeneous CMP

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources

arxiv: v1 [astro-ph.im] 2 Feb 2017

Memory Architectures. Week 2, Lecture 1. Copyright 2009 by W. Feng. Based on material from Matthew Sottile.

INF5063: Programming heterogeneous multi-core processors Introduction

OpenMP on the IBM Cell BE

QDP++/Chroma on IBM PowerXCell 8i Processor

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola

Scheduling streaming applications on a complex multicore platform

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Software Development Kit for Multicore Acceleration Version 3.0

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

The Pennsylvania State University. The Graduate School. College of Engineering A NEURAL NETWORK BASED CLASSIFIER ON THE CELL BROADBAND ENGINE

Portable Parallel Programming for Multicore Computing

What does Heterogeneity bring?

Interconnection of Clusters of Various Architectures in Grid Systems

High Performance Computing. University questions with solution

Portland State University ECE 588/688. Cray-1 and Cray T3E

Parallel Exact Inference on the Cell Broadband Engine Processor

The Pennsylvania State University. The Graduate School. College of Engineering PFFTC: AN IMPROVED FAST FOURIER TRANSFORM

FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine. David A. Bader, Virat Agarwal

PS3 programming basics. Week 1. SIMD programming on PPE Materials are adapted from the textbook

Parallel Hyperbolic PDE Simulation on Clusters: Cell versus GPU

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

Introduction to the Cell multiprocessor

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

IBM HPC DIRECTIONS. Dr Don Grice. ECMWF Workshop November, IBM Corporation

Parallel Computer Architecture - Basics -

Programming for Performance on the Cell BE processor & Experiences at SSSU. Sri Sathya Sai University

High-Performance Modular Multiplication on the Cell Broadband Engine

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Evaluating Multicore Architectures for Application in High Assurance Systems

Cell Broadband Engine Overview

Lecture 8: RISC & Parallel Computers. Parallel computers

Xbox 360 Architecture. Lennard Streat Samuel Echefu

Final Lecture. A few minutes to wrap up and add some perspective

Parallel Computing Platforms

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

WHY PARALLEL PROCESSING? (CE-401)

Portland State University ECE 588/688. IBM Power4 System Microarchitecture

Applications on emerging paradigms in parallel computing

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

OpenMP on the IBM Cell BE

Cell Programming Tips & Techniques

Portland State University ECE 588/688. Graphics Processors

Kaisen Lin and Michael Conley

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011

QDP++ on Cell BE WEI WANG. June 8, 2009

Towards Efficient Video Compression Using Scalable Vector Graphics on the Cell Broadband Engine

Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

Memory Systems IRAM. Principle of IRAM

high performance medical reconstruction using stream programming paradigms

Tesla GPU Computing A Revolution in High Performance Computing

POWER7: IBM's Next Generation Server Processor

CISC 879 Software Support for Multicore Architectures Spring Student Presentation 6: April 8. Presenter: Pujan Kafle, Deephan Mohan

CSE 392/CS 378: High-performance Computing - Principles and Practice

Experts in Application Acceleration Synective Labs AB

POWER6 Processor and Systems

Modeling Advanced Collective Communication Algorithms on Cell-based Systems

PowerPC 740 and 750

4. Shared Memory Parallel Architectures

IBM Virtual Fabric Architecture

W H I T E P A P E R W i t h I t s N e w P o w e r X C e l l 8i Product Line, IBM Intends to Take Accelerated Processing into the HPC Mainstream

PowerPC TM 970: First in a new family of 64-bit high performance PowerPC processors

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

Ray Tracing on the Cell Processor

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Driving a Hybrid in the Fast-lane: The Petascale Roadrunner System at Los Alamos

Lecture 2 Parallel Programming Platforms

CONSOLE ARCHITECTURE

Computer Organization

Transcription:

Roadrunner By Diana Lleva Julissa Campos Justina Tandar

Overview Roadrunner background On-Chip Interconnect Number of Cores Memory Hierarchy Pipeline Organization Multithreading Organization

Roadrunner Background Currently fastest supercomputer in the world Completed in 2009 3 Phase Project: 1 Base System 76 Teraflops 2 Advanced algorithms evaluate cell 3 Roadrunner 1.3 Petaflops, cell clustered, Opteron Cluster

Roadrunner at a Glance Structure: 6,912 AMD Opteron 2210 Processors 12,960 IBM PowerXCell 8i cells Speed: 1.71 petaflops(peak) Rank: 1 (Top500) as of June 2008 Cost: $133 M Memory: 103.6 TiB Power: 2.35 MW

On-Chip Interconnect

On-Chip Interconnect

Power Processor Element (PPE) 1 PPE/chip 1 IBM s VMX unit 32k L1 Cache 512k L2 Cache 2 way simultaneous multithreading

On-Chip Interconnect

Synergistic Processing Elements (SPE) 8 SPE s/chip 128 bit SIMD ISA 128x128 bit Register File 256k Local Store 1 Memory Flow Controller Isolation Mode (Security)

On-Chip Interconnect

Element Interconnect Bus (EIB) 1 bus/chip Bandwidth: 96 B/cycle Four 16 B data rings for internal communication Data port supports 25.6 GB/sec in each direction Command bus supports: 102.4 GB/sec coherent commands 307.2 GB/sec between units

EIB Example of EIB running 8 transactions concurrently

On-Chip Interconnect

System Memory Interface Memory Controller & RambusXDRAM Interface Bandwidth: 16 B/cycle Speed: 25.6 GB/s

On-Chip Interconnect

Input/Output Interface I/O Controller & Rambus FlexIO Bandwidth: 2*(16 B/cycle)

Number of Cores 6,912 dual-core AMD Opteronprocessors and 12,960 of IBM s Cell edpaccelerators Not including the cores in the operations and communication nodes: 6,120 Opteron(2 cores) + 12,240 PowerXCell8i (9 cores) = 122,400 cores

Pipeline Organization 6 Stages In-order Execution Dual Issue Pipeline 2 Pipes In case of instruction swap Single issue 9 Units per Pipeline

Pipeline Organization Floating-Point Unit Permute Unit 16 B/cycle Fixed-Point Unit Load-Store Unit Branch Unit Channel Unit Local Store (256 kb) Single Port SRAM 16 B/cycle 16 B/cycle Result Forwarding and Staging Register File 8 B/cycle Instruction Issue Unit/Instruction Line Buffer 64 B/cycle 128B Read 128B Write 8 B/cycle DMA Unit

Pipeline Organization

Pipeline Organization Permute Unit

Pipeline Organization Load-Store Unit

Pipeline Organization Channel Unit Branch Unit

Pipeline Organization Floating Point Unit Fixed Point Unit Note: These are not the only locations for these units

Instruction Issue Unit Pipeline Organization

Pipeline Organization Register File Unit

*More detailed look into the pipeline Pipeline Organization

SPE& PPE Cell/B.E.

SPE & PPE This implementation was based on issues of memory latency PPE has 32 KB L1 instruction and data caches PPE has 512 KB L2 unified (instruction and data) cache Each SPE'sDMA controllers can maintain up to 16 DMA transfers in flight simultaneously

PPE vs. SPE PPE: Accesses memory like a conventional processor Uses load and store commands Moves data between main storage and registers SPE: Uses direct memory access commands Moves data between main storage and a private local memory

PPE vs. SPE PPE: Faster at task switching SPE: More adept at compute intensive tasks Specialization is a factor for improvement in performance and power efficiency

Double Buffering DMA commands are processed in parallel with software execution

SIMD Organization IBM PowerXCell8i: 8 Synergistic Processor Elements (SPE) + 1 PowerPC Processing Elements (PPE)

Synergistic Processing Elements Optimized for computing intensive code using SIMD Lots of SIMD registers (128 x 128-bit) Operations (per cycle): sixteen 8-bit integers eight 16-bit integers four 32-bit integers four single-precision floating-point numbers

PowerPC Processing Elements Optimized for control intensive code VMX SIMD unit: 128-bit SIMD register Operations (per cycle): sixteen 8-bit integers eight 16-bit integers four 32-bit integers

SIMD in VMX and SPE 128 bit-wide datapath 128 bit-wide registers

TriBlade One IBM LS21 Opteronblade Two IBM BladeCenter QS22 Cell/B.E. blades One blade that houses the communications fabric for the compute node

TriBlade

TriBlade

TriBlade One Cell/B.E. chip for each Opteroncore Hierarchy: Opteronprocessors establish a master-subordinate relationship with the Cell/B.E. processors. Each TriBladenode is capable of 400 gigaflops of double precision compute power

Multithreading Threads alternate fetch and dispatch cycles To run multithreaded programs, multiple SPEs run simultaneously Usually, these programs create and organize the SPE threads as needed Each SPE supports one thread The PPE can support two threads without a program creating them

Multithreading