Top500 Supercomputer list

Similar documents
BlueGene/L (No. 4 in the Latest Top500 List)

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

High Performance Computing in C and C++

Transputers. The Lost Architecture. Bryan T. Meyers. December 8, Bryan T. Meyers Transputers December 8, / 27

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

COSC 6385 Computer Architecture - Multi Processor Systems

Computer Architecture

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Lecture 7: Parallel Processing

Introduction to parallel computing

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

Multi-core Programming - Introduction

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Introduction to Parallel Programming

FLYNN S TAXONOMY OF COMPUTER ARCHITECTURE

Parallel Computing Introduction

Lecture 9: MIMD Architectures

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

High Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA

Moore s Law. Computer architect goal Software developer assumption

Introduction to Parallel Programming

Lecture 7: Parallel Processing

Computer parallelism Flynn s categories

Moore s Law. Computer architect goal Software developer assumption

Lecture 9: MIMD Architecture

WHY PARALLEL PROCESSING? (CE-401)

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Chapter 11. Introduction to Multiprocessors

ARCHITECTURAL CLASSIFICATION. Mariam A. Salih

Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018

Fundamentals of Computer Design

Processor Architecture and Interconnect

Parallel and High Performance Computing CSE 745

Overview. Processor organizations Types of parallel machines. Real machines

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Fundamentals of Computers Design

Introduction II. Overview

CS 475: Parallel Programming Introduction

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Dr. Joe Zhang PDC-3: Parallel Platforms

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Introduction to High-Performance Computing

Chap. 4 Multiprocessors and Thread-Level Parallelism

Lecture 9: MIMD Architectures

Parallel Architectures

Parallel Computing Why & How?

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Parallel Architectures

Computer Organization. Chapter 16

Lecture 2 Parallel Programming Platforms

Cray XE6 Performance Workshop

Parallel Computing Platforms

Organisasi Sistem Komputer

Lecture 8: RISC & Parallel Computers. Parallel computers

Chapter 18 Parallel Processing

CMPE 511 TERM PAPER. Distributed Shared Memory Architecture. Seda Demirağ

THREAD LEVEL PARALLELISM

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

Parallel Processors. Session 1 Introduction

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Test on Wednesday! Material covered since Monday, Feb 8 (no Linux, Git, C, MD, or compiling programs)

Dheeraj Bhardwaj May 12, 2003

Architectures of Flynn s taxonomy -- A Comparison of Methods

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

Online Course Evaluation. What we will do in the last week?

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

Introduction to Parallel Processing

General introduction: GPUs and the realm of parallel architectures

Computer Architecture Crash course

Distributed System. Gang Wu. Spring,2019

Parallel Numerics, WT 2013/ Introduction

Introduction to Parallel Computing

Module 5 Introduction to Parallel Processing Systems

Copyright 2012, Elsevier Inc. All rights reserved.

Lect. 2: Types of Parallelism

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

Parallel Architecture. Sathish Vadhiyar

CSL 860: Modern Parallel

Scalability and Classifications

Normal computer 1 CPU & 1 memory The problem of Von Neumann Bottleneck: Slow processing because the CPU faster than memory

School of Parallel Programming & Parallel Architecture for HPC ICTP October, Intro to HPC Architecture. Instructor: Ekpe Okorafor

Parallel Computing: Parallel Architectures Jin, Hai

Computer Systems Architecture

Parallel Architecture. Hwansoo Han

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

COSC 6374 Parallel Computation. Parallel Computer Architectures

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

RISC Processors and Parallel Processing. Section and 3.3.6

Unit 9 : Fundamentals of Parallel Processing

Computer Architecture Spring 2016

COSC 6374 Parallel Computation. Parallel Computer Architectures

Types of Parallel Computers

Parallel Architectures

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Chapter 5: Thread-Level Parallelism Part 1

Transcription:

Top500 Supercomputer list Tends to represent parallel computers, so distributed systems such as SETI@Home are neglected. Does not consider storage or I/O issues Both custom designed machines and commodity machines win positions in the list General trend towards commodity machines (COTS Commodity-Off-The-Shelf). Connecting a large number of machines with relatively lower performance is more rewarding than connecting a small number of machines each with high performance reference paper: A note on the Zipf distribution of Top500 supercomputers (download from my homepage) Performance doubles each year, better than Moore s Law. Moore s Law : performance doubles approximately every 18 months Dominated by the United States UK supercomputers in the latest list (announced on 11/2009) Edinburgh: 20 ECMWF: 33 and 34 (last year 22) Southampton: 74 AWE: 172 (last year: 77) 1

BlueGene/L (No. 4 in the Latest Top500 List) first supercomputer in the Blue Gene project architecture. Individual PowerPC 440 processors at 700Mhz Two processors reside in a single chip. Two chips reside on a compute card with 512MB memory. 16 of these compute cards are placed on a node board. 32 node boards fit into one cabinet, and there are 64 cabinets. 212992 CPUs with theoretical peak of 596.4 TFLOPS Multiple network topologies available, which can be selected depending on the application. High density of processors in a small area: comparatively slow processors - just lots of them! Fast interconnects and low-latency. 2

Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures into four classes, based on how many instruction and data streams can be observed in the architecture. They are: SISD - Single Instruction, Single Data Instructions are operated sequentially on a single stream of data in a single memory. Classic Von Neumann architecture. Machines may still consist of multiple processors, operating on independent data - these can be considered as multiple SISD systems. SIMD - Single Instruction, Multiple Data A single instruction stream (broadcast to all PE*s), acting on multiple data. The most common form of this architecture class are Vector processors. These can deliver results several times faster than scalar processors. * PE = Processing Element 3

Architecture Classifications MISD - Multiple instruction, Single data No practical implementations of this architecture. MIMD - Multiple instruction, Multiple data Independent instruction streams, acting on different (but related) data Note the difference between multiple SISD and MIMD 4

Architecture Classifications MIMD: MPP, Cluster, SMP and NUMA SISD: Machine with a single scalar processor SIMD: Machine with vector processors 5

Parallelism in single processor systems Pipelines Performing more operations per clock cycle (Reduces the idle time of hardware components). Difficult to keep pipelines full (Good performance with independent instructions) Branch prediction helps Vector architectures One master processor and multiple math co-processors Large memory bandwidth and low latency access. No cache because of above. Perform operations involving large matrices, commonly encountered in engineering areas 6

Multiprocessor Parallelism Use multiple processors on the same program: Divide up a task between processors. dividing up a data structure, each processor working on it s own data Typically processors need to communicate. Shared memory is one approach Explicit messaging is increasingly common. distributed shared memory (virtual global address space) Load balancing is critical for maintaining good performance. 7

Granularity of Parallelism Defined as the size of the computations that are being performed in parallel Four types of parallelism (in order of granularity size) Instruction-level parallelism (e.g. pipeline) Thread-level parallelism (e.g. run a multi-thread java program) Process-level parallelism (e.g. run an MPI job in a cluster) Job-level parallelism (e.g. run a batch of independent jobs in a cluster) 8

Dependency and Parallelism Dependency: If event B must occur after event A, then B is dependent on A Two types of Dependency Control dependency: waiting for the instruction which controls the execution flow to be completed IF (X!=0) Then Y=1.0/X: Y has the control dependency on X!=0 Data dependency: dependency due to memory access Flow dependency: A=X+Y; B=A+C; Anti-dependency: B=A+C; A=X+Y; Output dependency: A=2; X=A+1; A=5; 9

Identifying Dependency Draw a Directed Acyclic Graph (DAG) to identify the dependency among a sequence of instructions A B C E Anti-dependency: a variable appears as a parent in a calculation and then as a child in a later calculation Output dependency: a variable appears as a child in a calculation and then as a child again in a later calculation X=A+B D=X*17 A=B+C X=C+E 1 X 2 D 3 anti A output 17 anti 4 X 10

High Performance Computing Course Notes 2009-2010 2010 Models of Parallel Programming

Models of Parallel Programming Different approaches for programming on parallel and distributed computing systems include: Dedicated languages designed specifically for parallel computers Smart compilers, which automatically parallelise sequential codes Data parallelism: multiple processors run the same operation on different elements of a data structure Shared memory: processors share a common address space message passing: the memory in each processor has its own address space 12

Specially Designed Language 13

Occam Occam is a concurrent programming language Occam is an executable implementation of Communicating Sequential Processes (CSP) theory CSP: a mathematical theory for describing the interactions of tasks in a concurrent system Can theoretically prove if the program written in Occam is correct Occam is specially designed to make full use of the architecture characteristics of the computer system consisting of the transputer (transistor computer) chips, developed by INMOS Transputer is the first microprocessor specially designed for parallel computing A number of transputer chips are wired to form a complete computer system (no bus, RAM or OS) In Occam, the processes communicate through channels Channel can be regarded as the message passing mechanism within a computer 14

Occam Sequential execution SEQ x := x + 1 y := x * x Parallel execution: PAR p() q() Communication between processes: ProcessA! Channel_var ProcessB? Channel_var 15

Occam The Occam language was not popular Poor portability Transputer chip is very expensive For more information: Occam 2 reference manual www.wotug.org/occam/documentation/oc21refman.pdf Occam archive http://vl.fmnet.info/occam/ Transputer http://en.wikipedia.org/wiki/inmos_transputer 16

Dedicated languages In general dedicated languages are going to do a better job 1. Designed with the hardware architecture 2. Structure of the language reflects the nature of parallelism However 1. Niche languages are not generally popular 2. It s hard to port existing code Much better to modify and extend an existing language to include parallelism, because 1. Better audience 2. Only need to learn new constructs or API, not a new language 3. Porting is a lot easier 17