BlueGene/L (No. 4 in the Latest Top500 List)

Similar documents
Top500 Supercomputer list

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

High Performance Computing in C and C++

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

COSC 6385 Computer Architecture - Multi Processor Systems

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

FLYNN S TAXONOMY OF COMPUTER ARCHITECTURE

ARCHITECTURAL CLASSIFICATION. Mariam A. Salih

Introduction to Parallel Programming

Parallel Architectures

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer parallelism Flynn s categories

High Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA

Computer Architecture

Introduction to parallel computing

Chapter 11. Introduction to Multiprocessors

Processor Architecture and Interconnect

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Lecture 7: Parallel Processing

Parallel Computing Introduction

Lecture 2 Parallel Programming Platforms

Introduction II. Overview

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

Introduction to Parallel Programming

Lecture 7: Parallel Processing

Multi-core Programming - Introduction

Dr. Joe Zhang PDC-3: Parallel Platforms

Parallel Numerics, WT 2013/ Introduction

Overview. Processor organizations Types of parallel machines. Real machines

Architectures of Flynn s taxonomy -- A Comparison of Methods

Objectives of the Course

Advanced Computer Architecture. The Architecture of Parallel Computers

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

Introduction to Parallel Computing

Module 5 Introduction to Parallel Processing Systems

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

Parallel computer architecture classification

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Parallel Computing Platforms

Transputers. The Lost Architecture. Bryan T. Meyers. December 8, Bryan T. Meyers Transputers December 8, / 27

COSC 6374 Parallel Computation. Parallel Computer Architectures

Normal computer 1 CPU & 1 memory The problem of Von Neumann Bottleneck: Slow processing because the CPU faster than memory

COSC 6374 Parallel Computation. Parallel Computer Architectures

Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Lect. 2: Types of Parallelism

Flynn s Taxonomy of Parallel Architectures

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

A taxonomy of computer architectures

Chap. 4 Multiprocessors and Thread-Level Parallelism

Lecture 8: RISC & Parallel Computers. Parallel computers

Computer Systems Architecture

THREAD LEVEL PARALLELISM

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Introduction to Parallel Processing

Computer and Hardware Architecture II. Benny Thörnberg Associate Professor in Electronics

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Computer Architecture Spring 2016

Dheeraj Bhardwaj May 12, 2003

Introduction to High-Performance Computing

Unit 9 : Fundamentals of Parallel Processing

Online Course Evaluation. What we will do in the last week?

Parallel Architecture. Sathish Vadhiyar

Multiprocessors & Thread Level Parallelism

Lecture 9: MIMD Architectures

Parallel Computing Why & How?

Scalability and Classifications

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011

CS 475: Parallel Programming Introduction

WHY PARALLEL PROCESSING? (CE-401)

Chapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Types of Parallel Computers

Parallel Numerics, WT 2017/ Introduction. page 1 of 127

SMD149 - Operating Systems - Multiprocessing

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

School of Parallel Programming & Parallel Architecture for HPC ICTP October, Intro to HPC Architecture. Instructor: Ekpe Okorafor

Lecture 9: MIMD Architecture

High Performance Computing Systems

Announcement. Computer Architecture (CSC-3501) Lecture 25 (24 April 2008) Chapter 9 Objectives. 9.2 RISC Machines

CSL 860: Modern Parallel

10th August Part One: Introduction to Parallel Computing

Computer Systems Architecture

Test on Wednesday! Material covered since Monday, Feb 8 (no Linux, Git, C, MD, or compiling programs)

CSE 260 Introduction to Parallel Computation

Parallel and High Performance Computing CSE 745

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Embedded processors. Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto.

Parallel Computers. c R. Leduc

Lecture 24: Virtual Memory, Multiprocessors

Parallel Architectures

Parallel Processors. Session 1 Introduction

Copyright 2012, Elsevier Inc. All rights reserved.

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

High Performance Computing Course Notes Course Administration

Parallel Architecture. Hwansoo Han

Transcription:

BlueGene/L (No. 4 in the Latest Top500 List) first supercomputer in the Blue Gene project architecture. Individual PowerPC 440 processors at 700Mhz Two processors reside in a single chip. Two chips reside on a compute card with 512MB memory. 16 of these compute cards are placed on a node board. 32 node boards fit into one cabinet, and there are 64 cabinets. 212992 CPUs with theoretical peak of 596.4 TFLOPS Multiple network topologies available, which can be selected depending on the application. High density of processors in a small area: comparatively slow processors - just lots of them! Fast interconnects and low-latency. 1

Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures into four classes, based on how many instruction and data streams can be observed in the architecture. They are: SISD - Single Instruction, Single Data Instructions are operated sequentially on a single stream of data in a single memory. Classic Von Neumann architecture. Machines may still consist of multiple processors, operating on independent data - these can be considered as multiple SISD systems. SIMD - Single Instruction, Multiple Data A single instruction stream (broadcast to all PE*s), acting on multiple data. The most common form of this architecture class are Vector processors. These can deliver results several times faster than scalar processors. * PE = Processing Element 2

Architecture Classifications MISD - Multiple instruction, Single data No practical implementations of this architecture. MIMD - Multiple instruction, Multiple data Independent instruction streams, acting on different (but related) data Note the difference between multiple SISD and MIMD 3

Architecture Classifications MIMD: MPP, Cluster, SMP and NUMA SISD: Machine with a single scalar processor SIMD: Machine with vector processors 4

Parallelism in single processor systems Pipelines Performing more operations per clock cycle (Reduces the idle time of hardware components). Difficult to keep pipelines full (Good performance with independent instructions) Branch prediction helps Vector architectures One master processor and multiple math co-processors Large memory bandwidth and low latency access. No cache because of above. Perform operations involving large matrices, commonly encountered in engineering areas 5

Multiprocessor Parallelism Use multiple processors on the same program: Divide up a task between processors. dividing up a data structure, each processor working on it s own data Typically processors need to communicate. Shared memory is one approach Explicit messaging is increasingly common. distributed shared memory (virtual global address space) Load balancing is critical for maintaining good performance. 6

Granularity of Parallelism Defined as the size of the computations that are being performed in parallel Four types of parallelism (in order of granularity size) Instruction-level parallelism (e.g. pipeline) Thread-level parallelism (e.g. run a multi-thread java program) Process-level parallelism (e.g. run an MPI job in a cluster) Job-level parallelism (e.g. run a batch of independent jobs in a cluster) 7

Dependency and Parallelism Dependency: If event B must occur after event A, then B is dependent on A Two types of Dependency Control dependency: waiting for the instruction which controls the execution flow to be completed IF (X!=0) Then Y=1.0/X: Y has the control dependency on X!=0 Data dependency: dependency due to memory access Flow dependency: A=X+Y; B=A+C; Anti-dependency: B=A+C; A=X+Y; Output dependency: A=2; X=A+1; A=5; 8

Identifying Dependency Draw a Directed Acyclic Graph (DAG) to identify the dependency among a sequence of instructions A B C E Anti-dependency: a variable appears as a parent in a calculation and then as a child in a later calculation Output dependency: a variable appears as a child in a calculation and then as a child again in a later calculation X=A+B D=X*17 A=B+C X=C+E 1 X 2 D 3 anti A output 17 anti 4 X 9